GenotypeGVCFs error IndexOutOfBoundsException
AnsweredI try to perform joint genotyping with genotypeGVCF.
Before i run haplotypecaller then Genomicsdbiimport and now genotypesGVCF.
I used gatk 4.2.2.0.
I run Genomicsdbiimport by interval so i would like to perform genotypesGVCF by interval.
the command use for genotypesgvcf is :
gatk \
--java-options "-Xmx${memory_java}M -Xms${memory_java}M -XX:ParallelGCThreads=${SLURM_CPUS_PER_TASK}" \
GenotypeGVCFs \
-R ${REF_Genome} \
-V gendb://${vcf_database_tmp} \
-O ${TMP_DIR}/gentaumix_interval_${SLURM_ARRAY_TASK_ID}_raw.vcf.gz \
-D ${DBSNP} \
--sequence-dictionary ${Dict} \
-L ${Interval} \
-G StandardAnnotation -G AS_StandardAnnotation \
--only-output-calls-starting-in-intervals \
--merge-input-intervals \
2> ${log_DIR}/Interval_${SLURM_ARRAY_TASK_ID}
And the log (for one interval but it's the same for all):
Using GATK jar /shared/ifbstor1/projects/gentaumix/conda/envs/gatk_4.2.2.0/share/gatk4-4.2.2.0-0/gatk-package-4.2.2.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx7000M -Xms7000M -XX:ParallelGCThreads=2 -jar /shared/ifbstor1/projects/gentaumix/conda/envs/gatk_4.2.2.0/share/gatk4-4.2.2.0-0/gatk-package-4.2.2.0-local.jar GenotypeGVCFs -R /shared/projects/gentaumix/Ressources/grch38_BWA_2/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fa -V gendb:///tmp/tmp.6QEyWPGpWs/vcf_database/Interval_6 -O /tmp/tmp.6QEyWPGpWs/gentaumix_interval_6_raw.vcf.gz -D /shared/projects/gentaumix/Ressources/known_sites/Homo_sapiens_assembly38.dbsnp138.vcf --sequence-dictionary /shared/projects/gentaumix/Ressources/known_sites/Homo_sapiens_assembly38.dict -L /shared/projects/gentaumix/Ressources/interval_genomicsdbi/temp_6/interval.interval_list -G StandardAnnotation -G AS_StandardAnnotation --merge-input-intervals
14:17:35.171 WARN GATKAnnotationPluginDescriptor - Redundant enabled annotation group (StandardAnnotation) is enabled for this tool by default
14:17:35.232 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/shared/ifbstor1/projects/gentaumix/conda/envs/gatk_4.2.2.0/share/gatk4-4.2.2.0-0/gatk-package-4.2.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Sep 10, 2021 2:17:35 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
14:17:35.492 INFO GenotypeGVCFs - ------------------------------------------------------------
14:17:35.492 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.2.2.0
14:17:35.492 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
14:17:35.493 INFO GenotypeGVCFs - Executing as quentin67100@cpu-node-9 on Linux v3.10.0-1160.6.1.el7.x86_64 amd64
14:17:35.493 INFO GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_282-b08
14:17:35.493 INFO GenotypeGVCFs - Start Date/Time: 10 septembre 2021 14:17:35 CEST
14:17:35.493 INFO GenotypeGVCFs - ------------------------------------------------------------
14:17:35.494 INFO GenotypeGVCFs - ------------------------------------------------------------
14:17:35.495 INFO GenotypeGVCFs - HTSJDK Version: 2.24.1
14:17:35.495 INFO GenotypeGVCFs - Picard Version: 2.25.4
14:17:35.495 INFO GenotypeGVCFs - Built for Spark Version: 2.4.5
14:17:35.495 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
14:17:35.495 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
14:17:35.495 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
14:17:35.495 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
14:17:35.495 INFO GenotypeGVCFs - Deflater: IntelDeflater
14:17:35.496 INFO GenotypeGVCFs - Inflater: IntelInflater
14:17:35.496 INFO GenotypeGVCFs - GCS max retries/reopens: 20
14:17:35.496 INFO GenotypeGVCFs - Requester pays: disabled
14:17:35.496 INFO GenotypeGVCFs - Initializing engine
14:17:36.493 INFO FeatureManager - Using codec VCFCodec to read file file:///shared/projects/gentaumix/Ressources/known_sites/Homo_sapiens_assembly38.dbsnp138.vcf
14:17:38.548 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.4.1-d59e886
14:17:39.941 info NativeGenomicsDB - pid=12231 tid=12232 No valid combination operation found for INFO field AS_InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
14:17:39.941 info NativeGenomicsDB - pid=12231 tid=12232 No valid combination operation found for INFO field AS_QD - the field will NOT be part of INFO fields in the generated VCF records
14:17:39.941 info NativeGenomicsDB - pid=12231 tid=12232 No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
14:17:39.941 info NativeGenomicsDB - pid=12231 tid=12232 No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
14:17:39.941 info NativeGenomicsDB - pid=12231 tid=12232 No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
14:17:41.513 INFO FeatureManager - Using codec IntervalListCodec to read file file:///shared/projects/gentaumix/Ressources/interval_genomicsdbi/temp_6/interval.interval_list
14:17:41.628 INFO IntervalArgumentCollection - Processing 62000000 bp from intervals
14:17:41.743 INFO GenotypeGVCFs - Done initializing engine
14:17:41.897 INFO ProgressMeter - Starting traversal
14:17:41.898 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
14:17:52.020 INFO ProgressMeter - chr2:60009645 0.2 9000 53349.1
14:18:02.071 INFO ProgressMeter - chr2:60039632 0.3 37000 110048.1
14:18:02.683 INFO GenotypeGVCFs - Shutting down engine
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),5.257768857000008,Cpu time(s),5.221303873999989
[10 septembre 2021 14:18:02 CEST] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 0.46 minutes.
Runtime.totalMemory()=7034372096
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:659)
at java.util.ArrayList.get(ArrayList.java:435)
at org.broadinstitute.hellbender.tools.walkers.annotator.allelespecific.StrandBiasUtils.combineAttributeMap(StrandBiasUtils.java:120)
at org.broadinstitute.hellbender.tools.walkers.annotator.allelespecific.AS_StrandBiasTest.combineRawData(AS_StrandBiasTest.java:118)
at org.broadinstitute.hellbender.tools.walkers.annotator.VariantAnnotatorEngine.combineAnnotations(VariantAnnotatorEngine.java:234)
at org.broadinstitute.hellbender.tools.walkers.ReferenceConfidenceVariantContextMerger.mergeAttributes(ReferenceConfidenceVariantContextMerger.java:318)
at org.broadinstitute.hellbender.tools.walkers.ReferenceConfidenceVariantContextMerger.merge(ReferenceConfidenceVariantContextMerger.java:142)
at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFsEngine.callRegion(GenotypeGVCFsEngine.java:130)
at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs.apply(GenotypeGVCFs.java:281)
at org.broadinstitute.hellbender.engine.VariantLocusWalker.lambda$traverse$0(VariantLocusWalker.java:135)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.forEachOrdered(ReferencePipeline.java:490)
at org.broadinstitute.hellbender.engine.VariantLocusWalker.traverse(VariantLocusWalker.java:132)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1085)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
For information the command used for haplotype caller and genomicsdbiimport are :
srun --ntasks=1 gatk --java-options "-Xmx${SLURM_MEM_PER_CPU}M" HaplotypeCaller \
-R ${REF_Genome} \
-L ${Scattered_DIR}/temp_${i}_of_${SCATTER_COUNT}/scattered.interval_list \
-I ${BAM_INPUT_DIR}/${BAM_INPUT} \
-O ${temp_gVCF_OUTPUT_DIR}/${i}.${GVCF_OUTPUT} \
-G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation \
-GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 50 -GQB 60 -GQB 70 -GQB 80 -GQB 90 \
-ERC GVCF \
--pcr-indel-model NONE \
2> ${logs_HC}/${i}.${GVCF_OUTPUT}.log &
and
gatk --java-options "-Xmx${memory_java}M -Xms${memory_java}M -XX:ParallelGCThreads=${SLURM_CPUS_PER_TASK}" GenomicsDBImport \
--sample-name-map ${cohort_map} \
--genomicsdb-workspace-path ${VCF_database_DIR_tmp}/Interval_${SLURM_ARRAY_TASK_ID} \
--batch-size 74 \
--reader-threads ${SLURM_CPUS_PER_TASK} \
-L ${Interval} \
--tmp-dir ${TMP_DIR} \
--genomicsdb-shared-posixfs-optimizations \
2> ${logs_DIR}/Interval_${SLURM_ARRAY_TASK_ID}.log
-
We released a new GATK version 4.2.3.0 containing a patch that we believe will solve your issue. Could you test out the newest GATK version and let us know if it solves the issue?
Best,
Genevieve
-
Thanks for posting this on the forum! This does look like a bug, so I have created an issue ticket so that the developers can take a look and make changes to the tools. Do you know if this issue occurs with GATK 4.2.1.0?
Best,
Genevieve
-
I don't try on 4.2 1.0 but it's work on 4.2.0.0 even if this, to my opinion, not a good deal to run all on gatk 4.2.2.0 eexcept genotypegvcf...
-
Yes, we definitely want to fix this issue! I am asking so that we can narrow in on what change caused this issue.
-
Hi! I have the same issue with versions 4.2.2.0 & 4.2.1.0. Works fine with 4.2.0.0. For the prior steps (preprocessing, haplotypecaller & genomicsDBImport) I used the latest (4.2.2.0).
-
Karoliina Salenius thank you for providing extra information!
We are looking into what is causing this bug and are wondering if the java.lang.IndexOutOfBoundsException comes up when certain annotations are missing.
Could you paste a few lines from your VCF using SelectVariants, Karoliina Salenius or Quentin Chartreux?
Thank you!
-
I don't know what is useful for you so this is first the begining of the header of the single sample gvcf:
##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele not already represented at this location by REF and ALT">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not int
ended to describe called alleles">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller --gvcf-gq-bands 10 --gvcf-gq-bands 20 --gvcf-gq-bands 30 --gvcf-gq-bands 40 --gvcf-gq-bands 50 --gvcf-gq-bands 60 --gvcf-gq-bands 70 --gvcf-gq
-bands 80 --gvcf-gq-bands 90 --pcr-indel-model NONE --emit-ref-confidence GVCF --output /shared/projects/gentaumix/processed_data/05_VCF/A_gVCF/G632_DA_C001KVY_1_HYH2CDSXX.DUAL2.trimmed.align.filtered.dedup.fi
xed.recal.bam/temp_gVCF/0011.G632_DA_C001KVY_1_HYH2CDSXX.DUAL2.trimmed.align.filtered.dedup.fixed.recal.g.vcf.gz --intervals /shared/projects/gentaumix/processed_data/05_VCF/A_gVCF/G632_DA_C001KVY_1_HYH2CDSXX.
DUAL2.trimmed.align.filtered.dedup.fixed.recal.bam/interval/temp_0011_of_50/scattered.interval_list --input /shared/projects/gentaumix/processed_data/03_Alignment/D_recal/G632_DA_C001KVY_1_HYH2CDSXX.DUAL2.trim
med.align.filtered.dedup.fixed.recal.bam --reference /shared/projects/gentaumix/Ressources/grch38_BWA_2/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fa --annotation-group StandardAnnotation --annota
tion-group AS_StandardAnnotation --annotation-group StandardHCAnnotation --use-posteriors-to-calculate-qual false --dont-use-dragstr-priors false --use-new-qual-calculator true --annotate-with-num-discovered-a
lleles false --heterozygosity 0.001 --indel-heterozygosity 1.25E-4 --heterozygosity-stdev 0.01 --standard-min-confidence-threshold-for-calling 30.0 --max-alternate-alleles 6 --max-genotype-count 1024 --sample-
ploidy 2 --num-reference-samples-if-no-call 0 --genotype-assignment-method USE_PLS_TO_ASSIGN --contamination-fraction-to-filter 0.0 --output-mode EMIT_VARIANTS_ONLY --all-site-pls false --floor-blocks false --
indel-size-to-eliminate-in-ref-model 10 --disable-optimizations false --dragen-mode false --apply-bqd false --apply-frd false --disable-spanning-event-genotyping false --transform-dragen-mapping-quality false
--mapping-quality-threshold-for-genotyping 20 --max-effective-depth-adjustment-for-frd 0 --just-determine-active-regions false --dont-genotype false --do-not-run-physical-phasing false --do-not-correct-overlap
ping-quality false --use-filtered-reads-for-annotations false --adaptive-pruning false --do-not-recover-dangling-branches false --recover-dangling-heads false --kmer-size 10 --kmer-size 25 --dont-increase-kmer
-sizes-for-cycles false --allow-non-unique-kmers-in-ref false --num-pruning-samples 1 --min-dangling-branch-length 4 --recover-all-dangling-branches false --max-num-haplotypes-in-population 128 --min-pruning 2
--adaptive-pruning-initial-error-rate 0.001 --pruning-lod-threshold 2.302585092994046 --pruning-seeding-lod-threshold 9.210340371976184 --max-unpruned-variants 100 --linked-de-bruijn-graph false --disable-art
ificial-haplotype-recovery false --enable-legacy-graph-cycle-detection false --debug-assembly false --debug-graph-transformations false --capture-assembly-failure-bam false --num-matching-bases-in-dangling-end
-to-recover -1 --error-correction-log-odds -Infinity --error-correct-reads false --kmer-length-for-read-error-correction 25 --min-observations-for-kmer-to-be-solid 20 --base-quality-score-threshold 18 --dragst
r-het-hom-ratio 2 --dont-use-dragstr-pair-hmm-scores false --pair-hmm-gap-continuation-penalty 10 --expected-mismatch-rate-for-read-disqualification 0.02 --pair-hmm-implementation FASTEST_AVAILABLE --phred-sca
led-global-read-mismapping-rate 45 --disable-symmetric-hmm-normalizing false --disable-cap-base-qualities-to-map-quality false --enable-dynamic-read-disqualification-for-genotyping false --dynamic-read-disqual
ification-threshold 1.0 --native-pair-hmm-threads 4 --native-pair-hmm-use-double-precision false --bam-writer-type CALLED_HAPLOTYPES --dont-use-soft-clipped-bases false --min-base-quality-score 10 --smith-wate
rman JAVA --max-mnp-distance 0 --force-call-filtered-alleles false --soft-clip-low-quality-ends false --allele-informative-reads-overlap-margin 2 --min-assembly-region-size 50 --max-assembly-region-size 300 --
active-probability-threshold 0.002 --max-prob-propagation-distance 50 --force-active false --assembly-region-padding 100 --padding-around-indels 75 --padding-around-snps 20 --padding-around-strs 75 --max-exten
sion-into-assembly-region-padding-legacy 25 --max-reads-per-alignment-start 50 --enable-legacy-assembly-region-trimming false --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --int
erval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --
create-output-variant-index true --create-output-variant-md5 false --max-variants-per-shard 0 --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40
--cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jd
k-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false --minimum-mapping-quality 20 --disable-tool-default-annotations false --enable-all-annotations
false --allow-old-rms-mapping-quality-annotation-data false",Version="4.2.2.0",Date="4 septembre 2021 07:24:25 CEST">
##GVCFBlock0-10=minGQ=0(inclusive),maxGQ=10(exclusive)
##GVCFBlock10-20=minGQ=10(inclusive),maxGQ=20(exclusive)
##GVCFBlock20-30=minGQ=20(inclusive),maxGQ=30(exclusive)
##GVCFBlock30-40=minGQ=30(inclusive),maxGQ=40(exclusive)
##GVCFBlock40-50=minGQ=40(inclusive),maxGQ=50(exclusive)
##GVCFBlock50-60=minGQ=50(inclusive),maxGQ=60(exclusive)
##GVCFBlock60-70=minGQ=60(inclusive),maxGQ=70(exclusive)
##GVCFBlock70-80=minGQ=70(inclusive),maxGQ=80(exclusive)
##GVCFBlock80-90=minGQ=80(inclusive),maxGQ=90(exclusive)
##GVCFBlock90-100=minGQ=90(inclusive),maxGQ=100(exclusive)
##INFO=<ID=AS_InbreedingCoeff,Number=A,Type=Float,Description="Allele-specific inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation"
>
##INFO=<ID=AS_QD,Number=A,Type=Float,Description="Allele-specific Variant Confidence/Quality by Depth">
##INFO=<ID=AS_RAW_BaseQRankSum,Number=1,Type=String,Description="raw data for allele specific rank sum test of base qualities">
##INFO=<ID=AS_RAW_MQ,Number=1,Type=String,Description="Allele-specfic raw data for RMS Mapping Quality">
##INFO=<ID=AS_RAW_MQRankSum,Number=1,Type=String,Description="Allele-specfic raw data for Mapping Quality Rank Sum">
##INFO=<ID=AS_RAW_ReadPosRankSum,Number=1,Type=String,Description="allele specific raw data for rank sum test of read position bias">
##INFO=<ID=AS_SB_TABLE,Number=1,Type=String,Description="Allele-specific forward/reverse read counts for strand bias tests. Includes the reference and alleles separated by |.">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=RAW_MQandDP,Number=2,Type=Integer,Description="Raw data (sum of squared MQ and total depth) for improved RMS Mapping Quality calculation. Incompatible with deprecated RAW_MQ formulation.">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=chr1,length=248956422>
##contig=<ID=chr2,length=242193529>
##contig=<ID=chr3,length=198295559>
##contig=<ID=chr4,length=190214555>Then there is the rest of the contig.
Exemple of one variant from the gvcf:
chr1 59193 . T G,<NON_REF> 380.64 . AS_RAW_BaseQRankSum=|-2.0,1|NaN;AS_RAW_MQ=39500.00|19339.00|0.00;AS_RAW_MQRankSum=|-0.7,1|NaN;AS_RAW_ReadPosRankSum=|0.8,1|NaN;AS_SB_TABLE=15,17|10,6|0,0;BaseQRankSum=-1.964;DP=48;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-0.645;RAW_MQandDP=58839,48;ReadPosRankSum=0.810 GT:AD:DP:GQ:PL:SB 0/1:32,16,0:48:99:388,0,930,484,978,1463:15,17,10,6
here is the begining of the header of the multi sample vcf obtained from the genomicsdb with selectvariant :
##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele not already represented at this location by REF and ALT">
##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine=<ID=GenomicsDBImport,CommandLine="GenomicsDBImport --genomicsdb-workspace-path /tmp/tmp.PKmeTjbuoM/vcf_database/Interval_1 --batch-size 74 --sample-name-map /shared/projects/gentaumix/processed_data/05_VCF/A_gVCF/gentaumix_cohort.sample_map --reader-threads 2 --genomicsdb-shared-posixfs-optimizations true --intervals /shared/projects/gentaumix/Ressources/interval_genomicsdbi/temp_1/interval.interval_list --tmp-dir /tmp/tmp.PKmeTjbuoM --genomicsdb-segment-size 1048576 --genomicsdb-vcf-buffer-size 16384 --overwrite-existing-genomicsdb-workspace false --consolidate false --validate-sample-name-map false --merge-input-intervals false --max-num-intervals-to-import-in-parallel 1 --merge-contigs-into-num-partitions 0 --genomicsdb-use-gcs-hdfs-connector false --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --max-variants-per-shard 0 --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 0 --cloud-index-prefetch-buffer 0 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false",Version="4.2.2.0",Date="10 septembre 2021 09:28:45 CEST">
##GATKCommandLine=<ID=SelectVariants,CommandLine="SelectVariants --output /shared/projects/gentaumix/processed_data/script/script_test/selectvariant/interval1.vcf.gz --variant gendb:///shared/projects/gentaumix/processed_data/05_VCF/B_vcf_database/Interval_1 --reference /shared/projects/gentaumix/Ressources/grch38_BWA_2/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fa --invertSelect false --exclude-non-variants false --exclude-filtered false --preserve-alleles false --remove-unused-alternates false --restrict-alleles-to ALL --keep-original-ac false --keep-original-dp false --mendelian-violation false --invert-mendelian-violation false --mendelian-violation-qual-threshold 0.0 --select-random-fraction 0.0 --remove-fraction-genotypes 0.0 --fully-decode false --max-indel-size 2147483647 --min-indel-size 0 --max-filtered-genotypes 2147483647 --min-filtered-genotypes 0 --max-fraction-filtered-genotypes 1.0 --min-fraction-filtered-genotypes 0.0 --max-nocall-number 2147483647 --max-nocall-fraction 1.0 --set-filtered-gt-to-nocall false --allow-nonoverlapping-command-line-samples false --suppress-reference-path false --genomicsdb-use-bcf-codec false --genomicsdb-shared-posixfs-optimizations false --genomicsdb-use-gcs-hdfs-connector false --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --max-variants-per-shard 0 --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false",Version="4.2.2.0",Date="21 septembre 2021 10:37:20 CEST">
##GVCFBlock0-10=minGQ=0(inclusive),maxGQ=10(exclusive)
##GVCFBlock10-20=minGQ=10(inclusive),maxGQ=20(exclusive)
##GVCFBlock20-30=minGQ=20(inclusive),maxGQ=30(exclusive)
##GVCFBlock30-40=minGQ=30(inclusive),maxGQ=40(exclusive)
##GVCFBlock40-50=minGQ=40(inclusive),maxGQ=50(exclusive)
##GVCFBlock50-60=minGQ=50(inclusive),maxGQ=60(exclusive)
##GVCFBlock60-70=minGQ=60(inclusive),maxGQ=70(exclusive)
##GVCFBlock70-80=minGQ=70(inclusive),maxGQ=80(exclusive)
##GVCFBlock80-90=minGQ=80(inclusive),maxGQ=90(exclusive)
##GVCFBlock90-100=minGQ=90(inclusive),maxGQ=100(exclusive)
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AS_InbreedingCoeff,Number=A,Type=Float,Description="Allele-specific inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=AS_QD,Number=A,Type=Float,Description="Allele-specific Variant Confidence/Quality by Depth">
##INFO=<ID=AS_RAW_BaseQRankSum,Number=1,Type=String,Description="raw data for allele specific rank sum test of base qualities">
##INFO=<ID=AS_RAW_MQ,Number=1,Type=String,Description="Allele-specfic raw data for RMS Mapping Quality">
##INFO=<ID=AS_RAW_MQRankSum,Number=1,Type=String,Description="Allele-specfic raw data for Mapping Quality Rank Sum">
##INFO=<ID=AS_RAW_ReadPosRankSum,Number=1,Type=String,Description="allele specific raw data for rank sum test of read position bias">
##INFO=<ID=AS_SB_TABLE,Number=1,Type=String,Description="Allele-specific forward/reverse read counts for strand bias tests. Includes the reference and alleles separated by |.">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=RAW_MQandDP,Number=2,Type=Integer,Description="Raw data (sum of squared MQ and total depth) for improved RMS Mapping Quality calculation. Incompatible with deprecated RAW_MQ formulation.">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=chr1,length=248956422,assembly=GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fa>
##contig=<ID=chr2,length=242193529,assembly=GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fa>
##contig=<ID=chr3,length=198295559,assembly=GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fa>
##contig=<ID=chr4,length=190214555,assembly=GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fa>
##contig=<ID=chr5,length=181538259,assembly=GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fa>
##contig=<ID=chr6,length=170805979,assembly=GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fa>
##contig=<ID=chr7,length=159345973,assembly=GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fa>and a variant:
chr1 13868 . A G,<NON_REF> . . AS_RAW_BaseQRankSum=|-4.000,1,-3.000,14,-2.000,19,-1.000,10,-0.000,1,1.000,2,4.000,1|;AS_RAW_MQ=466387.000|245200.000|0.000;AS_RAW_MQRankSum=|-1.000,8,-0.000,17,1.000,9,2.000,12,3.000,2|;AS_RAW_ReadPosRankSum=|-1.000,3,0.000,30,1.000,13,2.000,2|;AS_SB_TABLE=117,611|53,338|0,0;BaseQRankSum=-2.483;DP=1670;ExcessHet=3.0103;MQRankSum=0.917;RAW_MQandDP=724138,1141;ReadPosRankSum=0.429 GT:AD:GQ:MIN_DP:PL:SB:DP ./.:.:81:28:0,81,1215,81,1215,1215:.:30 ./.:.:90:32:0,90,1247,90,1247,1247:.:40 ./.:.:30:11:0,30,450,30,450,450:.:14 ./.:22,4,0:38:.:38,0,523,104,535,639:2,20,0,4:26 ./.:.:81:30:0,81,1215,81,1215,1215:.:30 ./.:11,4,0:70:.:70,0,285,103,297,400:1,10,0,4:15 ./.:15,11,0:99:.:236,0,344,281,377,658:0,15,3,8:26 ./.:.:90:32:0,90,1350,90,1350,1350:.:37 ./.:9,7,0:99:.:165,0,209,192,230,422:0,9,0,7:16 ./.:.:72:26:0,72,992,72,992,992:.:29 ./.:5,11,0:89:.:266,0,89,281,122,403:0,5,0,11:16 ./.:.:21:7:0,21,253,21,253,253:.:10 ./.:.:93:33:0,93,1317,93,1317,1317:.:34 ./.:9,12,0:99:.:288,0,182,315,218,533:0,9,2,10:21 ./.:11,8,0:99:.:196,0,251,229,275,504:0,11,0,8:19 ./.:4,9,0:71:.:222,0,71,234,98,332:0,4,1,8:13 ./.:.:72:24:0,72,923,72,923,923:.:26 ./.:16,6,0:99:.:104,0,390,152,408,560:3,13,1,5:22 ./.:8,2,0:26:.:26,0,195,50,201,251:0,8,1,1:10 ./.:.:12:6:0,12,180,12,180,180:.:6 ./.:.:0:0:0,0,0,0,0,0:.:0 ./.:.:42:14:0,42,480,42,480,480:.:16 ./.:10,14,0:99:.:350,0,208,381,250,631:0,10,0,14:24 ./.:7,2,0:29:.:29,0,168,50,174,225:0,7,1,1:9 ./.:6,5,0:99:.:115,0,134,133,149,282:0,6,0,5:11 ./.:17,2,0:0:.:0,0,463,51,469,520:8,9,0,2:19 ./.:.:51:19:0,51,643,51,643,643:.:19 ./.:.:81:29:0,81,1215,81,1215,1215:.:30 ./.:.:18:7:0,18,270,18,270,270:.:8 ./.:18,4,0:48:.:48,0,447,102,459,561:6,12,0,4:22 ./.:5,21,0:67:.:536,0,67,551,130,681:0,5,3,18:26 ./.:4,3,0:67:.:67,0,86,79,95,174:0,4,0,3:7 ./.:.:42:15:0,42,630,42,630,630:.:17 ./.:41,6,0:33:.:33,0,1077,155,1095,1251:13,28,0,6:47 ./.:.:84:34:0,84,1260,84,1260,1260:.:34 ./.:21,9,0:99:.:166,0,528,229,555,784:4,17,0,9:30 ./.:39,7,0:63:.:63,0,983,179,1004,1184:8,31,1,6:46 ./.:11,7,0:99:.:147,0,232,179,253,433:0,11,2,5:18 ./.:.:63:22:0,63,861,63,861,861:.:22 ./.:31,6,0:58:.:58,0,771,150,789,940:6,25,2,4:37 ./.:36,3,0:27:.:0,27,969,107,978,1059:11,25,0,3:39 ./.:14,7,0:99:.:130,0,347,172,368,541:0,14,1,6:21 ./.:3,9,0:44:.:219,0,44,228,71,299:0,3,1,8:12 ./.:.:42:14:0,42,537,42,537,537:.:17 ./.:.:93:31:0,93,1115,93,1115,1115:.:33 ./.:27,5,0:53:.:53,0,690,134,705,839:2,25,0,5:32 ./.:13,4,0:63:.:63,0,343,102,355,457:0,13,1,3:17
./.:6,6,0:99:.:136,0,129,154,147,302:0,6,2,4:12 ./.:5,29,0:41:.:717,0,41,732,128,859:0,5,5,24:34 ./.:.:90:33:0,90,1185,90,1185,1185:.:35 ./.:21,19,0:99:.:442,0,439,504,496,1000:0,21,4,15:40 ./.:.:12:5:0,12,180,12,180,180:.:6 ./.:36,9,0:99:.:130,0,903,238,930,1168:11,25,2,7:45 ./.:11,8,0:99:.:178,0,238,211,262,473:0,11,1,7:19 ./.:12,15,0:99:.:349,0,254,385,299,684:0,12,2,13:27 ./.:2,5,0:34:.:118,0,34,124,49,173:0,2,1,4:7 ./.:8,12,0:99:.:281,0,166,305,202,507:2,6,4,8:20 ./.:25,5,0:55:.:55,0,625,130,640,769:2,23,0,5:30 ./.:19,22,0:99:.:511,0,436,568,502,1070:3,16,5,17:41 ./.:4,8,0:57:.:182,0,57,193,80,273:0,4,1,7:12 ./.:12,4,0:68:.:68,0,289,104,301,405:0,12,0,4:16 ./.:1,6,0:7:.:160,0,7,164,25,188:0,1,2,4:7 ./.:11,2,0:19:.:19,0,253,51,259,311:1,10,0,2:13 ./.:17,7,0:99:.:123,0,413,174,434,608:0,17,2,5:24 ./.:5,9,0:94:.:233,0,94,248,121,368:0,5,0,9:14 ./.:.:60:22:0,60,900,60,900,900:.:23 ./.:39,11,0:99:.:161,0,999,278,1031,1310:17,22,0,11:50 ./.:15,6,0:99:.:103,0,363,148,381,530:1,14,1,5:21 ./.:34,6,0:55:.:55,0,851,157,869,1025:11,23,0,6:40 ./.:.:30:10:0,30,378,30,378,378:.:12 ./.:.:42:14:0,42,475,42,475,475:.:14 ./.:21,6,0:91:.:91,0,545,154,563,717:5,16,0,6:27 ./.:11,8,0:99:.:178,0,252,211,276,487:0,11,1,7:19 ./.:.:90:31:0,90,1350,90,1350,1350:.:32and for info in the log of the selectvariant i get this :
10:37:16.657 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.4.1-d59e886
10:37:18.433 info NativeGenomicsDB - pid=13958 tid=13959 No valid combination operation found for INFO field AS_InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
10:37:18.434 info NativeGenomicsDB - pid=13958 tid=13959 No valid combination operation found for INFO field AS_QD - the field will NOT be part of INFO fields in the generated VCF records
10:37:18.434 info NativeGenomicsDB - pid=13958 tid=13959 No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
10:37:18.434 info NativeGenomicsDB - pid=13958 tid=13959 No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
10:37:18.434 info NativeGenomicsDB - pid=13958 tid=13959 No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records -
Thank you so much Quentin Chartreux! It looks like the lines you shared with me contain the AS_SB_TABLE, which is the annotation in question. Could you double check that the variant records all contain this annotation using grep -v for AS_SB_TABLE? If any variant lines are output, then they do not contain that annotation.
-
Hi, can you please elaborate what is the issue when the variants don't contain the AS_SB_TABLE annotation. At which step would this occur and how to fix it? I'm following the allele-specific annotation and filtering of germline short variants and although the GenotypeGVCFs would run with the 4.2.0.0 -version, there were some issues with too many genotypes:
Chromosome chr19 position 1654996 (TileDB column 2656066283) has too many genotypes in the combined VCF record : 1128 : current limit : 1024 (num_alleles, ploidy) = (47, 2). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.
Is this related to the variants missing the AS_SB_TABLE annotations or a whole different issue altogether? Thanks!
-
Genevieve Brandt (she/her) for which file ?
The raw gvcf (the output of gatk 4.2.2.0 haplotypecaller) didn't contain any Line without these annotations.
The vcf (output of gatk 4.2.0.0 genotypegvcf) contain a lot, if not all, of variants without these annotation.
-
Karoliina Salenius I think the issue with too many genotypes is an unrelated warning. You can find some forum posts from other users with the same issue for more information. I am still diagnosing the issue causing the java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 so I was just checking if there was something strange with that annotation.
Quentin Chartreux I was wondering about the raw GVCF, thank you for checking.
I'll follow up again when I have more information.
-
Hi Quentin Chartreux, could you upload a test GVCF so that we can look into the AS_SB_TABLE more closely? The last position in the progress meter before the exception in GenotypeGVCFs is chr2:60039632. So, if you can find a subset of the file in that region that replicates this issue, that would be ideal. Here are the instructions for uploading bug reports: https://gatk.broadinstitute.org/hc/en-us/articles/360035889671
Thank you!
-
I just tried. It seems to be working. I just had an error on 2 interval of type out of memory so I will try to increase the memory (currently I do it in parallel on 50 intervals and I put 5 GB per interval for the job including 4 GB for the java Xmx and Xms)
-
Ok, let me know if you have more issues. For now I'll mark it as solved.
Please sign in to leave a comment.
14 comments