How to use the --alleles option of HaplotypeCaller ?
Hello!
We are trying to use the --alleles option of HaplotypeCaller.
Our goal :
- we have a set of goldenSNPs
- we have a sample, and we want to genotype this sample ONLY at the goldenSNPs positions.
We want as an output a vcf file with all (and only) the positions of the goldenSNPs and all relevant information from the sample (DP, AD, AF, GT, GQ and so on).
Reading the documentation, we understood that the --alleles option of HaplotypeCaller would be perfect for this.
Question 1: is that indeed the way to go ?
Question 2: we do not manange to run HaplotypeCaller with the --alleles option, and we get the error message "Cannot read file:///golden_snps.vcf because no suitable codecs found" (see full trace below).
Thank you for your help,
Chloé
REQUIRED for all errors and issues:
a) GATK version used: 4.2.2
b) Exact command used:
gatk422 HaplotypeCaller \
--native-pair-hmm-threads 8 \
-R TAIR10_chr_all.fasta \
-I sample111355A.bam \
-O sample111355A_genotyped.vcf \
--alleles golden_snps.vcf \
-L golden_snps.vcf \
--output-mode EMIT_ALL_CONFIDENT_SITES
c) Entire program log:
node17
WARNING: Could not find any nv files on this host!
Using GATK jar /gatk/gatk-package-4.2.2.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.2.2.0-local.jar HaplotypeCaller --native-pair-hmm-threads 8 -R TAIR10_chr_all.fasta -I sample111355A.bam -O sample111355A_genotyped.vcf --alleles golden_snps.vcf -L golden_snps.vcf --output-mode EMIT_ALL_CONFIDENT_SITES
16:37:55.653 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.2.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Nov 10, 2023 4:37:56 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
16:37:56.293 INFO HaplotypeCaller - ------------------------------------------------------------
16:37:56.293 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.2.2.0
16:37:56.294 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
16:37:56.294 INFO HaplotypeCaller - Executing as chloe.girard@node17 on Linux v5.10.0-21-amd64 amd64
16:37:56.294 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_242-8u242-b08-0ubuntu3~18.04-b08
16:37:56.295 INFO HaplotypeCaller - Start Date/Time: November 10, 2023 4:37:55 PM GMT
16:37:56.295 INFO HaplotypeCaller - ------------------------------------------------------------
16:37:56.295 INFO HaplotypeCaller - ------------------------------------------------------------
16:37:56.296 INFO HaplotypeCaller - HTSJDK Version: 2.24.1
16:37:56.296 INFO HaplotypeCaller - Picard Version: 2.25.4
16:37:56.296 INFO HaplotypeCaller - Built for Spark Version: 2.4.5
16:37:56.296 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
16:37:56.296 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:37:56.296 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
16:37:56.296 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:37:56.296 INFO HaplotypeCaller - Deflater: IntelDeflater
16:37:56.296 INFO HaplotypeCaller - Inflater: IntelInflater
16:37:56.297 INFO HaplotypeCaller - GCS max retries/reopens: 20
16:37:56.297 INFO HaplotypeCaller - Requester pays: disabled
16:37:56.297 INFO HaplotypeCaller - Initializing engine
16:37:56.877 INFO HaplotypeCaller - Shutting down engine
[November 10, 2023 4:37:56 PM GMT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=1548222464
***********************************************************************
A USER ERROR has occurred: Cannot read file:///golden_snps.vcf because no suitable codecs found
***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
Thank you!
-
Additional note (sorry!):
The golden_snps.vcf was generated by us, and does not have a header. It looks like this#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 139_genotyped_Indiv_28
Chr1 346 Chr1_346 C T PASS . NA NA NA
Chr1 502 Chr1_502 T C PASS . NA NA NA
Chr1 508 Chr1_508 T C PASS . NA NA NA
Chr1 657 Chr1_657 C T PASS . NA NA NA
Chr1 730 Chr1_730 G A PASS . NA NA NA
Chr1 754 Chr1_754 C T PASS . NA NA NA
Chr1 755 Chr1_755 T C PASS . NA NA NA
Chr1 864 Chr1_864 C T PASS . NA NA NA
Chr1 892 Chr1_892 T G PASS . NA NA NA
Chr1 1260 Chr1_1260 T A PASS . NA NA NA
Chr1 2309 Chr1_2309 T A PASS . NA NA NA
Chr1 2310 Chr1_2310 C A PASS . NA NA NA
Chr1 3102 Chr1_3102 A G PASS . NA NA NA
Chr1 4648 Chr1_4648 C A PASS . NA NA NA
Chr1 6059 Chr1_6059 T C PASS . NA NA NA
Chr1 6514 Chr1_6514 T C PASS . NA NA NA
Chr1 6603 Chr1_6603 T C PASS . NA NA NA
Chr1 8193 Chr1_8193 G A PASS . NA NA NA
Chr1 8780 Chr1_8780 G A PASS . NA NA NA
Chr1 9250 Chr1_9250 T A PASS . NA NA NA
Chr1 9498 Chr1_9498 C A PASS . NA NA NA
Chr1 9846 Chr1_9846 G A PASS . NA NA NA
Chr1 10360 Chr1_10360 T C PASS . NA NA NA
Chr1 11493 Chr1_11493 G A PASS . NA NA NA
Chr1 12584 Chr1_12584 A C PASS . NA NA NA -
Hi Chloé Girard
Yes this is the way to force genotype known alleles from a VCF file. GATK is quite strict in conforming HTS standards therefore a VCF file header and an index file is necessary for GATK to read through VCF files.
A generic header can be generated using the below line
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="Variants Passing All Filters">Your file seems to contain a FORMAT field and a sample field however those fields cannot be filled with ambiguous objects therefore it is better you remove any columns after INFO field. A simple header can be added to your VCF file by using
bcftools reheader
tool. However GATK will most likely to ask you a compatible sequence dictionary is needed in the VCF therefore once you add your first header line to your VCF file you may need to use
gatk UpdateVCFSequenceDictionary
tool and a compatible sequence dictionary to that VCF alleles.
Finally using
gatk IndexFeatureFile
tool will index your VCF file therefore it will be ready to use by HaplotypeCaller --alleles parameter.
Regards.
-
Hello!
Thank you for your quick and thorough answer. Unfortunately, we do not manage to generate a header.
We followed the indicated steps:
1. We added##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="Variants Passing All Filters">to the header of our file.
The full header now reads;##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="Variants Passing All Filters">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 139_genotyped_Indiv_28Step 2. We removed all columns after the INFO field (using R, and saving the file as a tabulated file). The "goldenSNPsimple.vcf" file now reads:
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="Variants Passing All Filters">
#CHROM POS ID REF ALT FILTER INFO
Chr1 346 Chr1_346 C T PASS .
Chr1 502 Chr1_502 T C PASS .
Chr1 508 Chr1_508 T C PASS .
Chr1 657 Chr1_657 C T PASS .
Chr1 730 Chr1_730 G A PASS .
Chr1 754 Chr1_754 C T PASS .
Chr1 755 Chr1_755 T C PASS .
Chr1 864 Chr1_864 C T PASS .Step 3: Running bcftools rehaeader command. This is when things went wrong.
bcftools reheader -h goldenSNPs_simple.vcf -o goldenSNPs_simple_header.vcf
The error that we get is as follow:
bcftools: bgzf.c:811: bgzf_read: Assertion `fp->is_write == 0' failed.
/var/spool/pbs/mom_priv/jobs/506945.pbsserver.SC: line 19: 3693648 Aborted bcftools reheader -h goldenSNPs_simple.vcf -o goldenSNPs_simple_header.vcfLooking for solutions:
1. We tried searching for a way around it with picard FixVcfHeaderpicard.jar FixVcfHeader \
-I=goldenSNPs_simple.vcf \
-O=goldenSNPs_simple_header.vcfand we get the following error
Your input file has a malformed header: there are not enough columns present in the header line: #CHROM POS ID REF ALT FILTER INFO
2. We tried adding lines to the VCF as a header manually, creating the "goldenSNPsimple_manual.vcf" file. The file now reads :
##fileformat=VCFv4.1
##source=goldenSNPs_script
##contig=<ID=1,length=30427671>
##contig=<ID=2,length=19698289>
##contig=<ID=3,length=23459830>
##contig=<ID=4,length=18585056>
##contig=<ID=5,length=26975502>
##contig=<ID=mitochondria,length=366924>
##contig=<ID=chloroplast,length=154478>
##FILTER=<ID=PASS,Description="Variants Passing All Filters">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
#CHROM POS ID REF ALT FILTER INFO
Chr1 346 Chr1_346 C T PASS .
Chr1 502 Chr1_502 T C PASS .
Chr1 508 Chr1_508 T C PASS .
Chr1 657 Chr1_657 C T PASS .
Chr1 730 Chr1_730 G A PASS .
Chr1 754 Chr1_754 C T PASS .
Chr1 755 Chr1_755 T C PASS .
Chr1 864 Chr1_864 C T PASS .
Chr1 892 Chr1_892 T G PASS .
Chr1 1260 Chr1_1260 T A PASS .And then we tried running the gatk UpdateVCFSequenceDictionary command
gatk UpdateVCFSequenceDictionary \
# -V goldenSNPsimple_manual.vcf \
# --source-dictionary TAIR10_chr_all.dict \
# --output goldenSNPsimple_manual_dict.vcfAnd then we get the same error
Your input file has a malformed header: there are not enough columns present in the header line: #CHROM POS ID REF ALT FILTER INFO
Why are there not enough columns ???
If you can help us, we would greatly appreciate it.
Thank you!!!
Chloé -
Update: We solved the first problem by "inventing" columns :)
##fileformat=VCFv4.1
##source=goldenSNPs_script
##contig=<ID=1,length=30427671>
##contig=<ID=2,length=19698289>
##contig=<ID=3,length=23459830>
##contig=<ID=4,length=18585056>
##contig=<ID=5,length=26975502>
##FILTER=<ID=PASS,Description="Variants Passing All Filters">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
Chr1 346 Chr1_346 C T 1 PASS DP=1 GT:GQ:DP:HQ 0/1:50:2:28
Chr1 502 Chr1_502 T C 1 PASS DP=1 GT:GQ:DP:HQ 0/1:50:2:28
Chr1 508 Chr1_508 T C 1 PASS DP=1 GT:GQ:DP:HQ 0/1:50:2:28
Chr1 657 Chr1_657 C T 1 PASS DP=1 GT:GQ:DP:HQ 0/1:50:2:28
Chr1 730 Chr1_730 G A 1 PASS DP=1 GT:GQ:DP:HQ 0/1:50:2:28
Chr1 754 Chr1_754 C T 1 PASS DP=1 GT:GQ:DP:HQ 0/1:50:2:28
Chr1 755 Chr1_755 T C 1 PASS DP=1 GT:GQ:DP:HQ 0/1:50:2:28
Chr1 864 Chr1_864 C T 1 PASS DP=1 GT:GQ:DP:HQ 0/1:50:2:28
Chr1 892 Chr1_892 T G 1 PASS DP=1 GT:GQ:DP:HQ 0/1:50:2:28We'll keep you updated on the next steps.
-
Hi Chloé Girard
VCF format has immutable columns to be valid for processing. Your file was missing the QUAL field. Anything east of INFO is not necessary for a VCF to be valid (In that case it becomes a sites only VCF).
Yet again your solution is valid and should not cause you any issues if all your known variants are compatible with your reference genome, left aligned and simplified (meaning no MNPs or Complex Substitutions).
Regards.
-
Hello ! It worked!!
Let's recap for future users:
The final vcf was created manually (in R and in a text editor) to look like this (notice that we changed "Chr1" to "1" to match the TAIR10 reference) :##fileformat=VCFv4.1
##source=goldenSNPs_script
##contig=<ID=1,length=30427671>
##contig=<ID=2,length=19698289>
##contig=<ID=3,length=23459830>
##contig=<ID=4,length=18585056>
##contig=<ID=5,length=26975502>
##FILTER=<ID=PASS,Description="Variants Passing All Filters">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
#CHROM POS ID REF ALT QUAL FILTER INFO
1 346 . C T 1 PASS .
1 502 . T C 1 PASS .
1 508 . T C 1 PASS .
1 657 . C T 1 PASS .
1 730 . G A 1 PASS .
1 754 . C T 1 PASS .
1 755 . T C 1 PASS .
1 864 . C T 1 PASS .
1 892 . T G 1 PASS .Then we just ran the UpdateVCFSequenceDictionary function
gatk2 UpdateVCFSequenceDictionary \
-V goldenSNPsFORMATTED.vcf \
--source-dictionary TAIR10_chr_all.dict \
--replace true \
--output goldenSNPsFORMATTED_dict.vcfThis created both a VCF and an index.
We then ran HaplotypeCaller with the dict.vcf/opt/singularity/calls_gatk/gatk422 HaplotypeCaller \
--native-pair-hmm-threads 8 \
-R TAIR10_chr_all.fasta \
-I sample.bam \
-O sample_genotypedbygoldenSNPs.vcf \
--alleles goldenSNPsFORMATTED_dict.vcf \
-L goldenSNPsFORMATTED_dict.vcf \
--output-mode EMIT_ALL_CONFIDENT_SITES
and we have our final file!!
EDIT: We do not have the right number of lines. See next post for details.##fileformat=VCFv4.2
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller --output-mode EMIT_ALL_CONFIDENT_SITES --native-pair-hmm-threads 8 --alleles /store/EQUIPES/MRP/Partage/Arabidopsis_commun/ALL_SCRIPTS_HERE/Git/polyrec/data/data_test/data_polyrec/parental_golden_snps_139_genotyped_Indiv_28_merged_FORMATTED_IDpoint_noChr_dict.vcf --output /store/EQUIPES/MRP/Partage/Arabidopsis_commun/Genome_sequences/CO_mapping/Tests/111355_AA_genotypedbygoldenSNPs.vcf --intervals parental_golden_snps_139_genotyped_Indiv_28_merged_FORMATTED_IDpoint_noChr_dict.vcf --input /store/EQUIPES/MRP/Partage/Arabidopsis_commun/Genome_sequences/CO_mapping/PROCESSEDDATA/Individus_XXVIII_28/111355_AA/111355_AAmarkdup.bam --reference /store/EQUIPES/MRP/Partage/Arabidopsis_commun/Genome_sequences/REF_genomes/TAIR10_chr_all.fasta --use-posteriors-to-calculate-qual false --dont-use-dragstr-priors false --use-new-qual-calculator true --annotate-with-num-discovered-alleles false --heterozygosity 0.001 --indel-heterozygosity 1.25E-4 --heterozygosity-stdev 0.01 --standard-min-confidence-threshold-for-calling 30.0 --max-alternate-alleles 6 --max-genotype-count 1024 --sample-ploidy 2 --num-reference-samples-if-no-call 0 --genotype-assignment-method USE_PLS_TO_ASSIGN --contamination-fraction-to-filter 0.0 --all-site-pls false --gvcf-gq-bands 1 --gvcf-gq-bands 2 --gvcf-gq-bands 3 --gvcf-gq-bands 4 --gvcf-gq-bands 5 --gvcf-gq-bands 6 --gvcf-gq-bands 7 --gvcf-gq-bands 8 --gvcf-gq-bands 9 --gvcf-gq-bands 10 --gvcf-gq-bands 11 --gvcf-gq-bands 12 --gvcf-gq-bands 13 --gvcf-gq-bands 14 --gvcf-gq-bands 15 --gvcf-gq-bands 16 --gvcf-gq-bands 17 --gvcf-gq-bands 18 --gvcf-gq-bands 19 --gvcf-gq-bands 20 --gvcf-gq-bands 21 --gvcf-gq-bands 22 --gvcf-gq-bands 23 --gvcf-gq-bands 24 --gvcf-gq-bands 25 --gvcf-gq-bands 26 --gvcf-gq-bands 27 --gvcf-gq-bands 28 --gvcf-gq-bands 29 --gvcf-gq-bands 30 --gvcf-gq-bands 31 --gvcf-gq-bands 32 --gvcf-gq-bands 33 --gvcf-gq-bands 34 --gvcf-gq-bands 35 --gvcf-gq-bands 36 --gvcf-gq-bands 37 --gvcf-gq-bands 38 --gvcf-gq-bands 39 --gvcf-gq-bands 40 --gvcf-gq-bands 41 --gvcf-gq-bands 42 --gvcf-gq-bands 43 --gvcf-gq-bands 44 --gvcf-gq-bands 45 --gvcf-gq-bands 46 --gvcf-gq-bands 47 --gvcf-gq-bands 48 --gvcf-gq-bands 49 --gvcf-gq-bands 50 --gvcf-gq-bands 51 --gvcf-gq-bands 52 --gvcf-gq-bands 53 --gvcf-gq-bands 54 --gvcf-gq-bands 55 --gvcf-gq-bands 56 --gvcf-gq-bands 57 --gvcf-gq-bands 58 --gvcf-gq-bands 59 --gvcf-gq-bands 60 --gvcf-gq-bands 70 --gvcf-gq-bands 80 --gvcf-gq-bands 90 --gvcf-gq-bands 99 --floor-blocks false --indel-size-to-eliminate-in-ref-model 10 --disable-optimizations false --dragen-mode false --apply-bqd false --apply-frd false --disable-spanning-event-genotyping false --transform-dragen-mapping-quality false --mapping-quality-threshold-for-genotyping 20 --max-effective-depth-adjustment-for-frd 0 --just-determine-active-regions false --dont-genotype false --do-not-run-physical-phasing false --do-not-correct-overlapping-quality false --use-filtered-reads-for-annotations false --adaptive-pruning false --do-not-recover-dangling-branches false --recover-dangling-heads false --kmer-size 10 --kmer-size 25 --dont-increase-kmer-sizes-for-cycles false --allow-non-unique-kmers-in-ref false --num-pruning-samples 1 --min-dangling-branch-length 4 --recover-all-dangling-branches false --max-num-haplotypes-in-population 128 --min-pruning 2 --adaptive-pruning-initial-error-rate 0.001 --pruning-lod-threshold 2.302585092994046 --pruning-seeding-lod-threshold 9.210340371976184 --max-unpruned-variants 100 --linked-de-bruijn-graph false --disable-artificial-haplotype-recovery false --enable-legacy-graph-cycle-detection false --debug-assembly false --debug-graph-transformations false --capture-assembly-failure-bam false --num-matching-bases-in-dangling-end-to-recover -1 --error-correction-log-odds -Infinity --error-correct-reads false --kmer-length-for-read-error-correction 25 --min-observations-for-kmer-to-be-solid 20 --base-quality-score-threshold 18 --dragstr-het-hom-ratio 2 --dont-use-dragstr-pair-hmm-scores false --pair-hmm-gap-continuation-penalty 10 --expected-mismatch-rate-for-read-disqualification 0.02 --pair-hmm-implementation FASTEST_AVAILABLE --pcr-indel-model CONSERVATIVE --phred-scaled-global-read-mismapping-rate 45 --disable-symmetric-hmm-normalizing false --disable-cap-base-qualities-to-map-quality false --enable-dynamic-read-disqualification-for-genotyping false --dynamic-read-disqualification-threshold 1.0 --native-pair-hmm-use-double-precision false --bam-writer-type CALLED_HAPLOTYPES --dont-use-soft-clipped-bases false --min-base-quality-score 10 --smith-waterman JAVA --emit-ref-confidence NONE --max-mnp-distance 0 --force-call-filtered-alleles false --soft-clip-low-quality-ends false --allele-informative-reads-overlap-margin 2 --min-assembly-region-size 50 --max-assembly-region-size 300 --active-probability-threshold 0.002 --max-prob-propagation-distance 50 --force-active false --assembly-region-padding 100 --padding-around-indels 75 --padding-around-snps 20 --padding-around-strs 75 --max-extension-into-assembly-region-padding-legacy 25 --max-reads-per-alignment-start 50 --enable-legacy-assembly-region-trimming false --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --max-variants-per-shard 0 --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false --minimum-mapping-quality 20 --disable-tool-default-annotations false --enable-all-annotations false --allow-old-rms-mapping-quality-annotation-data false",Version="4.2.2.0",Date="November 14, 2023 9:34:23 AM GMT">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
##contig=<ID=1,length=30427671>
##contig=<ID=2,length=19698289>
##contig=<ID=3,length=23459830>
##contig=<ID=4,length=18585056>
##contig=<ID=5,length=26975502>
##contig=<ID=mitochondria,length=366924>
##contig=<ID=chloroplast,length=154478>
##source=HaplotypeCaller
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 111355_AA
1 346 . C T 33.36 . AC=0;AF=0.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.693 GT:AD:DP:GQ:PL 0/0:2,0:2:6:0,6,78
1 502 . T C 48.77 . AC=0;AF=0.00;AN=2;DP=7;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.223 GT:AD:DP:GQ:PL 0/0:7,0:7:21:0,21,243
1 508 . T C 45.74 . AC=0;AF=0.00;AN=2;DP=6;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.307 GT:AD:DP:GQ:PL 0/0:6,0:6:18:0,18,205
1 657 . C T 45.74 . AC=0;AF=0.00;AN=2;DP=6;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.693 GT:AD:DP:GQ:PL 0/0:6,0:6:18:0,18,233
1 730 . G A 42.62 . AC=0;AF=0.00;AN=2;DP=5;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.446 GT:AD:DP:GQ:PL 0/0:5,0:5:15:0,15,165
1 754 . C T 39.73 . AC=0;AF=0.00;AN=2;DP=4;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.223 GT:AD:DP:GQ:PL 0/0:4,0:4:12:0,12,155
1 755 . T C 39.73 . AC=0;AF=0.00;AN=2;DP=4;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.223 GT:AD:DP:GQ:PL 0/0:4,0:4:12:0,12,155
1 864 . C T 33.36 . AC=0;AF=0.00;AN=2;DP=3;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.105 GT:AD:DP:GQ:PL 0/0:2,0:2:6:0,6,78 -
Hello!
We are running in another issue: we do not retreive the right number of calls when using the --alleles option of HaplotypeCaller.
We tried variations of this command line:gatk4.2.2 HaplotypeCaller \
--native-pair-hmm-threads 8 \
-R TAIR10_chr_all.fasta \
-I sample.bam \
-O sample_genotypedbygoldenSNPs.vcf \
--alleles goldenSNPsFORMATTED_dict.vcf \
-L goldenSNPsFORMATTED_dict.vcf \
--output-mode EMIT_ALL_CONFIDENT_SITESOur goldenSNPs list has 411,187 SNPs.
With this command line, the results has 399,158 SNPs
We tried adding--force-call-filtered-alleles true
This gives us the same number (399,158)
We tried removing the -L option, this gives us 499,519 SNPs. BUT, no data is retrieved for 13,446 SNPs from the goldenSNPs list (NA).What combination of options should we use to get a file with exactly, and only, the 411,187 SNPs from our goldenSNPs list genotypes in our sample, and all of them with data (if no reads, DP=0 is fine!!).
-
Hi Chloé Girard
Looking at your command line, you used
--output-mode EMIT_ALL_CONFIDENT_SITES
which is actually not what you need for your purpose.
GATK HaplotypeCaller documentation states
Specifies which type of calls we should output
The --output-mode argument is an enumerated type (OutputMode), which can have one of the following values:
EMIT_VARIANTS_ONLY
produces calls only at variant sites
EMIT_ALL_CONFIDENT_SITES
produces calls at variant sites and confident reference sites
EMIT_ALL_ACTIVE_SITES
Produces calls at any region over the activity threshold regardless of confidence. On occasion, this will output HOM_REF records where no call could be confidently made. This does not necessarily output calls for all sites in a region. This argument is intended only for point mutations (SNPs); it will not produce a comprehensive set of indels.So your best options are actually EMIT_VARIANTS_ONLY or EMIT_ALL_ACTIVE_SITES.
It is possible that some of the known sites fall into high complexity regions where reference confidence falls to 0 therefore there is no call produced for those sites regardless of the state of the known variant.
Beware that you need to also use your VCF input file as -L parameter due to the reason that a former HaplotypeCaller parameter
--genotyping-mode GENOTYPE_GIVEN_ALLELES
is no longer available due to design decisions. When only --alleles parameter is supplied with a known set of variants HaplotypeCaller will force call those sites however it will also continue to genotype other regions and will output other variant sites as well. If you wish to get results of only known alleles then you need to supply your known allele VCF file to -L parameter as well.
I hope this helps.
-
Hello!
I was wondering if you managed to get the "--alleles" option running successfully, so you get an output VCF with exactly the variants plus alleles from an input file called. I did not.
I am running GATK 4.2.6.1-1 with a call like this:
gatk HaplotypeCaller -R GRCh38.fna -L scorefiles.vcf.gz --alleles scorefiles.vcf.gz --output-mode EMIT_VARIANTS_ONLY -I wgs.cram -O scorefiles.out.vcf.gz
The result is quite near, but still some entries are missing.
-
Hello Mario!
No we never get the right number of lines in the vcf. We generate a vcf that has more lines, then use AWK to remove the lines that we're not interested in. Problem is: there are sometimes some lines missing.
Otherwise, we use the InGap tool (but it's not maintained) with the -ALE fonction.
https://sourceforge.net/projects/ingap-family/ -
Hi all.
Can you share some of the missing positions (Original entry from the allele file as well) from your output with IGV views as well.
-
After some research into the missing variants, I found that here the REF in the VCF provided via --alleles was not matching the REF in the reference genome. After editing the VCF file accordingly, I obtained all variants of my example case. So matching REF alleles seems a precondition for reporting by GATK. Thanks for caring!
-
Hi Mario Fasold
Thank you for the feedback. This was exactly what I was curious about.
Regards.
Please sign in to leave a comment.
13 comments