Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

How to use the --alleles option of HaplotypeCaller ?

0

13 comments

  • Avatar
    Chloé Girard

    Additional note (sorry!):
    The golden_snps.vcf was generated by us, and does not have a header. It looks like this

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  139_genotyped_Indiv_28
    Chr1    346     Chr1_346        C       T       PASS    .       NA      NA      NA
    Chr1    502     Chr1_502        T       C       PASS    .       NA      NA      NA
    Chr1    508     Chr1_508        T       C       PASS    .       NA      NA      NA
    Chr1    657     Chr1_657        C       T       PASS    .       NA      NA      NA
    Chr1    730     Chr1_730        G       A       PASS    .       NA      NA      NA
    Chr1    754     Chr1_754        C       T       PASS    .       NA      NA      NA
    Chr1    755     Chr1_755        T       C       PASS    .       NA      NA      NA
    Chr1    864     Chr1_864        C       T       PASS    .       NA      NA      NA
    Chr1    892     Chr1_892        T       G       PASS    .       NA      NA      NA
    Chr1    1260    Chr1_1260       T       A       PASS    .       NA      NA      NA
    Chr1    2309    Chr1_2309       T       A       PASS    .       NA      NA      NA
    Chr1    2310    Chr1_2310       C       A       PASS    .       NA      NA      NA
    Chr1    3102    Chr1_3102       A       G       PASS    .       NA      NA      NA
    Chr1    4648    Chr1_4648       C       A       PASS    .       NA      NA      NA
    Chr1    6059    Chr1_6059       T       C       PASS    .       NA      NA      NA
    Chr1    6514    Chr1_6514       T       C       PASS    .       NA      NA      NA
    Chr1    6603    Chr1_6603       T       C       PASS    .       NA      NA      NA
    Chr1    8193    Chr1_8193       G       A       PASS    .       NA      NA      NA
    Chr1    8780    Chr1_8780       G       A       PASS    .       NA      NA      NA
    Chr1    9250    Chr1_9250       T       A       PASS    .       NA      NA      NA
    Chr1    9498    Chr1_9498       C       A       PASS    .       NA      NA      NA
    Chr1    9846    Chr1_9846       G       A       PASS    .       NA      NA      NA
    Chr1    10360   Chr1_10360      T       C       PASS    .       NA      NA      NA
    Chr1    11493   Chr1_11493      G       A       PASS    .       NA      NA      NA
    Chr1    12584   Chr1_12584      A       C       PASS    .       NA      NA      NA

     

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Chloé Girard

    Yes this is the way to force genotype known alleles from a VCF file. GATK is quite strict in conforming HTS standards therefore a VCF file header and an index file is necessary for GATK to read through VCF files. 

    A generic header can be generated using the below line

    ##fileformat=VCFv4.1
    ##FILTER=<ID=PASS,Description="Variants Passing All Filters">

    Your file seems to contain a FORMAT field and a sample field however those fields cannot be filled with ambiguous objects therefore it is better you remove any columns after INFO field. A simple header can be added to your VCF file by using

    bcftools reheader 

    tool. However GATK will most likely to ask you a compatible sequence dictionary is needed in the VCF therefore once you add your first header line to your VCF file you may need to use 

    gatk UpdateVCFSequenceDictionary

    tool and a compatible sequence dictionary to that VCF alleles. 

    Finally using

    gatk IndexFeatureFile

    tool will index your VCF file therefore it will be ready to use by HaplotypeCaller --alleles parameter.

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Chloé Girard

    Hello!
    Thank you for your quick and thorough answer. Unfortunately, we do not manage to generate a header.

    We followed the indicated steps:
    1. We added

    ##fileformat=VCFv4.1
    ##FILTER=<ID=PASS,Description="Variants Passing All Filters">

    to the header of our file.
    The full header now reads;

    ##fileformat=VCFv4.1
    ##FILTER=<ID=PASS,Description="Variants Passing All Filters">
    #CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    139_genotyped_Indiv_28

    Step 2. We removed all columns after the INFO field (using R, and saving the file as a tabulated file). The "goldenSNPsimple.vcf" file now reads:

    ##fileformat=VCFv4.1
    ##FILTER=<ID=PASS,Description="Variants Passing All Filters">
    #CHROM    POS    ID    REF    ALT    FILTER    INFO
    Chr1    346    Chr1_346    C    T    PASS    .
    Chr1    502    Chr1_502    T    C    PASS    .
    Chr1    508    Chr1_508    T    C    PASS    .
    Chr1    657    Chr1_657    C    T    PASS    .
    Chr1    730    Chr1_730    G    A    PASS    .
    Chr1    754    Chr1_754    C    T    PASS    .
    Chr1    755    Chr1_755    T    C    PASS    .
    Chr1    864    Chr1_864    C    T    PASS    .

    Step 3: Running bcftools rehaeader command. This is when things went wrong.

    bcftools reheader -h goldenSNPs_simple.vcf -o goldenSNPs_simple_header.vcf

    The error that we get is as follow:

    bcftools: bgzf.c:811: bgzf_read: Assertion `fp->is_write == 0' failed.
    /var/spool/pbs/mom_priv/jobs/506945.pbsserver.SC: line 19: 3693648 Aborted                 bcftools reheader -h goldenSNPs_simple.vcf -o goldenSNPs_simple_header.vcf

    Looking for solutions:

    1. We tried searching for a way around it with picard FixVcfHeader

    picard.jar FixVcfHeader \
    -I=goldenSNPs_simple.vcf \
    -O=goldenSNPs_simple_header.vcf

    and we get the following error

    Your input file has a malformed header: there are not enough columns present in the header line: #CHROM       POS     ID      REF     ALT     FILTER  INFO

    2. We tried adding lines to the VCF as a header manually, creating the "goldenSNPsimple_manual.vcf" file. The file now reads :

    ##fileformat=VCFv4.1
    ##source=goldenSNPs_script
    ##contig=<ID=1,length=30427671>
    ##contig=<ID=2,length=19698289>
    ##contig=<ID=3,length=23459830>
    ##contig=<ID=4,length=18585056>
    ##contig=<ID=5,length=26975502>
    ##contig=<ID=mitochondria,length=366924>
    ##contig=<ID=chloroplast,length=154478>
    ##FILTER=<ID=PASS,Description="Variants Passing All Filters">
    ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
    #CHROM  POS     ID      REF     ALT     FILTER  INFO
    Chr1    346     Chr1_346        C       T       PASS    .
    Chr1    502     Chr1_502        T       C       PASS    .
    Chr1    508     Chr1_508        T       C       PASS    .
    Chr1    657     Chr1_657        C       T       PASS    .
    Chr1    730     Chr1_730        G       A       PASS    .
    Chr1    754     Chr1_754        C       T       PASS    .
    Chr1    755     Chr1_755        T       C       PASS    .
    Chr1    864     Chr1_864        C       T       PASS    .
    Chr1    892     Chr1_892        T       G       PASS    .
    Chr1    1260    Chr1_1260       T       A       PASS    .

    And then we tried running the gatk UpdateVCFSequenceDictionary command

    gatk UpdateVCFSequenceDictionary \
    #      -V goldenSNPsimple_manual.vcf \
    #      --source-dictionary TAIR10_chr_all.dict \
    #      --output goldenSNPsimple_manual_dict.vcf

    And then we get the same error

    Your input file has a malformed header: there are not enough columns present in the header line: #CHROM     POS     ID      REF     ALT     FILTER  INFO

     

    Why are there not enough columns ???

    If you can help us, we would greatly appreciate it.

    Thank you!!!
    Chloé

    0
    Comment actions Permalink
  • Avatar
    Chloé Girard

    Update: We solved the first problem by "inventing" columns :)

    ##fileformat=VCFv4.1
    ##source=goldenSNPs_script
    ##contig=<ID=1,length=30427671>
    ##contig=<ID=2,length=19698289>
    ##contig=<ID=3,length=23459830>
    ##contig=<ID=4,length=18585056>
    ##contig=<ID=5,length=26975502>
    ##FILTER=<ID=PASS,Description="Variants Passing All Filters">
    ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
    ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
    ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
    ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
    ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
    #CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    SAMPLE
    Chr1    346    Chr1_346    C    T    1    PASS    DP=1    GT:GQ:DP:HQ    0/1:50:2:28
    Chr1    502    Chr1_502    T    C    1    PASS    DP=1    GT:GQ:DP:HQ    0/1:50:2:28
    Chr1    508    Chr1_508    T    C    1    PASS    DP=1    GT:GQ:DP:HQ    0/1:50:2:28
    Chr1    657    Chr1_657    C    T    1    PASS    DP=1    GT:GQ:DP:HQ    0/1:50:2:28
    Chr1    730    Chr1_730    G    A    1    PASS    DP=1    GT:GQ:DP:HQ    0/1:50:2:28
    Chr1    754    Chr1_754    C    T    1    PASS    DP=1    GT:GQ:DP:HQ    0/1:50:2:28
    Chr1    755    Chr1_755    T    C    1    PASS    DP=1    GT:GQ:DP:HQ    0/1:50:2:28
    Chr1    864    Chr1_864    C    T    1    PASS    DP=1    GT:GQ:DP:HQ    0/1:50:2:28
    Chr1    892    Chr1_892    T    G    1    PASS    DP=1    GT:GQ:DP:HQ    0/1:50:2:28

    We'll keep you updated on the next steps.

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Chloé Girard

    VCF format has immutable columns to be valid for processing. Your file was missing the QUAL field. Anything east of INFO is not necessary for a VCF to be valid (In that case it becomes a sites only VCF). 

    Yet again your solution is valid and should not cause you any issues if all your known variants are compatible with your reference genome, left aligned and simplified (meaning no MNPs or Complex Substitutions).  

    Regards.

    0
    Comment actions Permalink
  • Avatar
    Chloé Girard

    Hello ! It worked!!

    Let's recap for future users:

    The final vcf was created manually (in R and in a text editor) to look like this (notice that we changed "Chr1" to "1" to match the TAIR10 reference) :

    ##fileformat=VCFv4.1
    ##source=goldenSNPs_script
    ##contig=<ID=1,length=30427671>
    ##contig=<ID=2,length=19698289>
    ##contig=<ID=3,length=23459830>
    ##contig=<ID=4,length=18585056>
    ##contig=<ID=5,length=26975502>
    ##FILTER=<ID=PASS,Description="Variants Passing All Filters">
    ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
    #CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO
    1    346    .    C    T    1    PASS    .
    1    502    .    T    C    1    PASS    .
    1    508    .    T    C    1    PASS    .
    1    657    .    C    T    1    PASS    .
    1    730    .    G    A    1    PASS    .
    1    754    .    C    T    1    PASS    .
    1    755    .    T    C    1    PASS    .
    1    864    .    C    T    1    PASS    .
    1    892    .    T    G    1    PASS    .

    Then we just ran the UpdateVCFSequenceDictionary function

    gatk2 UpdateVCFSequenceDictionary \
       -V goldenSNPsFORMATTED.vcf \
       --source-dictionary TAIR10_chr_all.dict \
         --replace true \
       --output goldenSNPsFORMATTED_dict.vcf

    This created both a VCF and an index.

    We then ran HaplotypeCaller with the dict.vcf

    /opt/singularity/calls_gatk/gatk422 HaplotypeCaller  \
       --native-pair-hmm-threads 8 \
     -R TAIR10_chr_all.fasta \
     -I sample.bam \
     -O sample_genotypedbygoldenSNPs.vcf \
     --alleles goldenSNPsFORMATTED_dict.vcf \
     -L goldenSNPsFORMATTED_dict.vcf \
       --output-mode EMIT_ALL_CONFIDENT_SITES


    and we have our final file!!
    EDIT: We do not have the right number of lines. See next post for details.

    ##fileformat=VCFv4.2
    ##FILTER=<ID=LowQual,Description="Low quality">
    ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
    ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
    ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
    ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
    ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
    ##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller --output-mode EMIT_ALL_CONFIDENT_SITES --native-pair-hmm-threads 8 --alleles /store/EQUIPES/MRP/Partage/Arabidopsis_commun/ALL_SCRIPTS_HERE/Git/polyrec/data/data_test/data_polyrec/parental_golden_snps_139_genotyped_Indiv_28_merged_FORMATTED_IDpoint_noChr_dict.vcf --output /store/EQUIPES/MRP/Partage/Arabidopsis_commun/Genome_sequences/CO_mapping/Tests/111355_AA_genotypedbygoldenSNPs.vcf --intervals parental_golden_snps_139_genotyped_Indiv_28_merged_FORMATTED_IDpoint_noChr_dict.vcf --input /store/EQUIPES/MRP/Partage/Arabidopsis_commun/Genome_sequences/CO_mapping/PROCESSEDDATA/Individus_XXVIII_28/111355_AA/111355_AAmarkdup.bam --reference /store/EQUIPES/MRP/Partage/Arabidopsis_commun/Genome_sequences/REF_genomes/TAIR10_chr_all.fasta --use-posteriors-to-calculate-qual false --dont-use-dragstr-priors false --use-new-qual-calculator true --annotate-with-num-discovered-alleles false --heterozygosity 0.001 --indel-heterozygosity 1.25E-4 --heterozygosity-stdev 0.01 --standard-min-confidence-threshold-for-calling 30.0 --max-alternate-alleles 6 --max-genotype-count 1024 --sample-ploidy 2 --num-reference-samples-if-no-call 0 --genotype-assignment-method USE_PLS_TO_ASSIGN --contamination-fraction-to-filter 0.0 --all-site-pls false --gvcf-gq-bands 1 --gvcf-gq-bands 2 --gvcf-gq-bands 3 --gvcf-gq-bands 4 --gvcf-gq-bands 5 --gvcf-gq-bands 6 --gvcf-gq-bands 7 --gvcf-gq-bands 8 --gvcf-gq-bands 9 --gvcf-gq-bands 10 --gvcf-gq-bands 11 --gvcf-gq-bands 12 --gvcf-gq-bands 13 --gvcf-gq-bands 14 --gvcf-gq-bands 15 --gvcf-gq-bands 16 --gvcf-gq-bands 17 --gvcf-gq-bands 18 --gvcf-gq-bands 19 --gvcf-gq-bands 20 --gvcf-gq-bands 21 --gvcf-gq-bands 22 --gvcf-gq-bands 23 --gvcf-gq-bands 24 --gvcf-gq-bands 25 --gvcf-gq-bands 26 --gvcf-gq-bands 27 --gvcf-gq-bands 28 --gvcf-gq-bands 29 --gvcf-gq-bands 30 --gvcf-gq-bands 31 --gvcf-gq-bands 32 --gvcf-gq-bands 33 --gvcf-gq-bands 34 --gvcf-gq-bands 35 --gvcf-gq-bands 36 --gvcf-gq-bands 37 --gvcf-gq-bands 38 --gvcf-gq-bands 39 --gvcf-gq-bands 40 --gvcf-gq-bands 41 --gvcf-gq-bands 42 --gvcf-gq-bands 43 --gvcf-gq-bands 44 --gvcf-gq-bands 45 --gvcf-gq-bands 46 --gvcf-gq-bands 47 --gvcf-gq-bands 48 --gvcf-gq-bands 49 --gvcf-gq-bands 50 --gvcf-gq-bands 51 --gvcf-gq-bands 52 --gvcf-gq-bands 53 --gvcf-gq-bands 54 --gvcf-gq-bands 55 --gvcf-gq-bands 56 --gvcf-gq-bands 57 --gvcf-gq-bands 58 --gvcf-gq-bands 59 --gvcf-gq-bands 60 --gvcf-gq-bands 70 --gvcf-gq-bands 80 --gvcf-gq-bands 90 --gvcf-gq-bands 99 --floor-blocks false --indel-size-to-eliminate-in-ref-model 10 --disable-optimizations false --dragen-mode false --apply-bqd false --apply-frd false --disable-spanning-event-genotyping false --transform-dragen-mapping-quality false --mapping-quality-threshold-for-genotyping 20 --max-effective-depth-adjustment-for-frd 0 --just-determine-active-regions false --dont-genotype false --do-not-run-physical-phasing false --do-not-correct-overlapping-quality false --use-filtered-reads-for-annotations false --adaptive-pruning false --do-not-recover-dangling-branches false --recover-dangling-heads false --kmer-size 10 --kmer-size 25 --dont-increase-kmer-sizes-for-cycles false --allow-non-unique-kmers-in-ref false --num-pruning-samples 1 --min-dangling-branch-length 4 --recover-all-dangling-branches false --max-num-haplotypes-in-population 128 --min-pruning 2 --adaptive-pruning-initial-error-rate 0.001 --pruning-lod-threshold 2.302585092994046 --pruning-seeding-lod-threshold 9.210340371976184 --max-unpruned-variants 100 --linked-de-bruijn-graph false --disable-artificial-haplotype-recovery false --enable-legacy-graph-cycle-detection false --debug-assembly false --debug-graph-transformations false --capture-assembly-failure-bam false --num-matching-bases-in-dangling-end-to-recover -1 --error-correction-log-odds -Infinity --error-correct-reads false --kmer-length-for-read-error-correction 25 --min-observations-for-kmer-to-be-solid 20 --base-quality-score-threshold 18 --dragstr-het-hom-ratio 2 --dont-use-dragstr-pair-hmm-scores false --pair-hmm-gap-continuation-penalty 10 --expected-mismatch-rate-for-read-disqualification 0.02 --pair-hmm-implementation FASTEST_AVAILABLE --pcr-indel-model CONSERVATIVE --phred-scaled-global-read-mismapping-rate 45 --disable-symmetric-hmm-normalizing false --disable-cap-base-qualities-to-map-quality false --enable-dynamic-read-disqualification-for-genotyping false --dynamic-read-disqualification-threshold 1.0 --native-pair-hmm-use-double-precision false --bam-writer-type CALLED_HAPLOTYPES --dont-use-soft-clipped-bases false --min-base-quality-score 10 --smith-waterman JAVA --emit-ref-confidence NONE --max-mnp-distance 0 --force-call-filtered-alleles false --soft-clip-low-quality-ends false --allele-informative-reads-overlap-margin 2 --min-assembly-region-size 50 --max-assembly-region-size 300 --active-probability-threshold 0.002 --max-prob-propagation-distance 50 --force-active false --assembly-region-padding 100 --padding-around-indels 75 --padding-around-snps 20 --padding-around-strs 75 --max-extension-into-assembly-region-padding-legacy 25 --max-reads-per-alignment-start 50 --enable-legacy-assembly-region-trimming false --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --max-variants-per-shard 0 --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays  --disable-tool-default-read-filters false --minimum-mapping-quality 20 --disable-tool-default-annotations false --enable-all-annotations false --allow-old-rms-mapping-quality-annotation-data false",Version="4.2.2.0",Date="November 14, 2023 9:34:23 AM GMT">
    ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
    ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
    ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
    ##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
    ##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
    ##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
    ##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
    ##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
    ##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
    ##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
    ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
    ##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
    ##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
    ##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
    ##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
    ##contig=<ID=1,length=30427671>
    ##contig=<ID=2,length=19698289>
    ##contig=<ID=3,length=23459830>
    ##contig=<ID=4,length=18585056>
    ##contig=<ID=5,length=26975502>
    ##contig=<ID=mitochondria,length=366924>
    ##contig=<ID=chloroplast,length=154478>
    ##source=HaplotypeCaller
    #CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    111355_AA
    1    346    .    C    T    33.36    .    AC=0;AF=0.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.693    GT:AD:DP:GQ:PL    0/0:2,0:2:6:0,6,78
    1    502    .    T    C    48.77    .    AC=0;AF=0.00;AN=2;DP=7;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.223    GT:AD:DP:GQ:PL    0/0:7,0:7:21:0,21,243
    1    508    .    T    C    45.74    .    AC=0;AF=0.00;AN=2;DP=6;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.307    GT:AD:DP:GQ:PL    0/0:6,0:6:18:0,18,205
    1    657    .    C    T    45.74    .    AC=0;AF=0.00;AN=2;DP=6;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.693    GT:AD:DP:GQ:PL    0/0:6,0:6:18:0,18,233
    1    730    .    G    A    42.62    .    AC=0;AF=0.00;AN=2;DP=5;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.446    GT:AD:DP:GQ:PL    0/0:5,0:5:15:0,15,165
    1    754    .    C    T    39.73    .    AC=0;AF=0.00;AN=2;DP=4;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.223    GT:AD:DP:GQ:PL    0/0:4,0:4:12:0,12,155
    1    755    .    T    C    39.73    .    AC=0;AF=0.00;AN=2;DP=4;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.223    GT:AD:DP:GQ:PL    0/0:4,0:4:12:0,12,155
    1    864    .    C    T    33.36    .    AC=0;AF=0.00;AN=2;DP=3;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.105    GT:AD:DP:GQ:PL    0/0:2,0:2:6:0,6,78
    0
    Comment actions Permalink
  • Avatar
    Chloé Girard

    Hello!

    We are running in another issue: we do not retreive the right number of calls when using the --alleles option of HaplotypeCaller.


    We tried variations of this command line:

    gatk4.2.2 HaplotypeCaller  \
       --native-pair-hmm-threads 8 \
     -R TAIR10_chr_all.fasta \
     -I sample.bam \
     -O sample_genotypedbygoldenSNPs.vcf \
     --alleles goldenSNPsFORMATTED_dict.vcf \
     -L goldenSNPsFORMATTED_dict.vcf \
    --output-mode EMIT_ALL_CONFIDENT_SITES

    Our goldenSNPs list has 411,187 SNPs.
    With this command line, the results has 399,158 SNPs


    We tried adding

    --force-call-filtered-alleles true 

    This gives us the same number (399,158)

    We tried removing the -L option, this gives us 499,519 SNPs. BUT, no data is retrieved for 13,446 SNPs from the goldenSNPs list (NA).

    What combination of options should we use to get a file with exactly, and only, the 411,187 SNPs from our goldenSNPs list genotypes in our sample, and all of them with data (if no reads, DP=0 is fine!!).

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Chloé Girard

    Looking at your command line, you used

    --output-mode EMIT_ALL_CONFIDENT_SITES

    which is actually not what you need for your purpose.

    GATK HaplotypeCaller documentation states

    Specifies which type of calls we should output
    The --output-mode argument is an enumerated type (OutputMode), which can have one of the following values:

    EMIT_VARIANTS_ONLY
    produces calls only at variant sites
    EMIT_ALL_CONFIDENT_SITES
    produces calls at variant sites and confident reference sites
    EMIT_ALL_ACTIVE_SITES
    Produces calls at any region over the activity threshold regardless of confidence. On occasion, this will output HOM_REF records where no call could be confidently made. This does not necessarily output calls for all sites in a region. This argument is intended only for point mutations (SNPs); it will not produce a comprehensive set of indels.

    So your best options are actually EMIT_VARIANTS_ONLY or EMIT_ALL_ACTIVE_SITES.

    It is possible that some of the known sites fall into high complexity regions where reference confidence falls to 0 therefore there is no call produced for those sites regardless of the state of the known variant. 

    Beware that you need to also use your VCF input file as -L parameter due to the reason that a former HaplotypeCaller parameter

    --genotyping-mode GENOTYPE_GIVEN_ALLELES

    is no longer available due to design decisions. When only --alleles parameter is supplied with a known set of variants HaplotypeCaller will force call those sites however it will also continue to genotype other regions and will output other variant sites as well. If you wish to get results of only known alleles then you need to supply your known allele VCF file to -L parameter as well. 

    I hope this helps.

    0
    Comment actions Permalink
  • Avatar
    Mario Fasold

    Hello!

    I was wondering if you managed to get the "--alleles" option running successfully, so you get an output VCF with exactly the variants plus alleles from an input file called. I did not.

    I am running GATK 4.2.6.1-1 with a call like this:

    gatk  HaplotypeCaller -R GRCh38.fna -L scorefiles.vcf.gz --alleles scorefiles.vcf.gz --output-mode EMIT_VARIANTS_ONLY -I wgs.cram -O scorefiles.out.vcf.gz

     

    The result is quite near, but still some entries are missing.

    0
    Comment actions Permalink
  • Avatar
    Chloé Girard

    Hello Mario!

    No we never get the right number of lines in the vcf. We generate a vcf that has more lines, then use AWK to remove the lines that we're not interested in. Problem is: there are sometimes some lines missing.

    Otherwise, we use the InGap tool (but it's not maintained) with the -ALE fonction.
    https://sourceforge.net/projects/ingap-family/

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi all. 

    Can you share some of the missing positions (Original entry from the allele file as well) from your output with IGV views as well. 

    0
    Comment actions Permalink
  • Avatar
    Mario Fasold

    After some research into the missing variants, I found that here the REF in the VCF provided via --alleles was not matching the REF in the reference genome. After editing the VCF file accordingly, I obtained all variants of my example case. So matching REF alleles seems a precondition for reporting by GATK. Thanks for caring!

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Mario Fasold

    Thank you for the feedback. This was exactly what I was curious about. 

    Regards. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk