Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Haplotypecaller FORMAT:DP is affected by the interval in WGS

0

7 comments

  • Avatar
    Gökalp Çelik

    Hi Chase Mateusiak

    Looking at your results, it looks like only one of those samples gets affected by this change but the other sample is showing a robust result. This could be due to a spanning deletion or a secondary mutation overlapping or neighboring the same interval for DA10. 

    You mentioned that you have looked at our documents on why variant calling might get affected by certain factors in HaplotypeCaller/Mutect2 but let me rephrase. 

    Both tools utilize a local reassembly and realignment to capture variants from sequencing data therefore 2 processes are using whatever reference data and read data provided to them to generate reassembled haplotypes. Certain conditions might affect the outcome of this reassembly such as providing too little reference genome information as intervals. Reassembly uses the read data provided within the interval you provide (and plus some due to some internal calculation for intervals). The more reads provided to the reassembly engine the better the resolution of haplotypes occur especially in hard to sequence regions with high GC/repeat content. Sometimes this results in loss of certain variants that you observe by eye in IGV and this is all due to similarities observed in close regions to the actual variant site that you are expecting. It is not necessarily a bad behavior by this tool since even aligners are not 100% sure about where reads should go when Q score is around 30 and less and region to map is hard to read by short reads. Some aligners prefer certain arrangements due to insert size estimations however unless you exactly know your insert sizes it is only an estimate. 

    Coming back to your question about depths, if you are absolutely sure about using this metric to call certain features in your application the best method would be to use a pileup with very strict parameters to eliminate duplicate, overlapping reads and badly mapped pairs as well as base and mapping qualities below certain thresholds. On the other hand HaplotypeCaller with its latest enhancements on variant calling supports using pileups for variant calls where local reassembly fails to provide better resolution for active regions. You may activate this behavior using the parameter below. Of course we recommend using the latest GATK 4.6.0.0. 

    --use-pdhmm <Boolean>         Partially Determined HMM, an alternative to the regular assembly haplotypes where we
                                  instead construct artificial haplotypes out of the union of the assembly and pileup
                                  alleles.  Default value: false. Possible values: {true, false} 

    I hope this helps.

    Regards. 

    1
    Comment actions Permalink
  • Avatar
    Chase Mateusiak

    I wrote a reply, hit submit, and it isn't appearing. Not sure if there is a wait time, so apologies if this is a repeat in part.

    I'm trying v4.6.0 now. However,

    ```
     gatk HaplotypeCaller \
         -I $INPUT_BAM \
         -R $REFERENCE_GENOME \
         -O $OUTPUT_VCF \
         -ERC GVCF \
         --use-pdhmm true
    ```

    produces the following error:

    ```

    java.lang.IllegalArgumentException: refHaplotype cannot be null

    ```
    which may be expected -- not sure if that is intended to be used with GVCF anyway. Without the GVCF output, the command does execute.

    I am trying `--pileup-detection` now -- it isn't quite clear to me yet what the difference between this, `--use-pdhmm` and the default mode is from the docs, though -- does there happen to be an article about this?

    Back to my region of interest, though: if this were an active region in the sample I am examining, then I could use the --bam-out to see the local re-alignment. Unfortunately, the discrepancy is happening over a region that is entirely REF with MAPQ scores of 60. Do you have any suggestions as to how I can better understand why these reads are being so affected by the interval?

    This is what the reads from 69200 to 69600 (from the left hand side of that image to 200 bp beyond the right hand side) look like

    # Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
    SN    raw total sequences:    85    # excluding supplementary and secondary reads
    SN    filtered sequences:    0
    SN    sequences:    85
    SN    is sorted:    1
    SN    1st fragments:    41
    SN    last fragments:    44
    SN    reads mapped:    85
    SN    reads mapped and paired:    85    # paired-end technology bit set + both mates mapped
    SN    reads unmapped:    0error rate:    7.843137e-04    # mismatches / bases mapped (cigar)
    SN    reads properly paired:    84    # proper-pair bit set
    SN    reads paired:    85    # paired-end technology bit set
    SN    reads duplicated:    1    # PCR or optical duplicate bit set
    SN    reads MQ0:    0    # mapped and MQ=0
    SN    reads QC failed:    0
    SN    non-primary alignments:    0
    SN    supplementary alignments:    0
    SN    total length:    6375    # ignores clipping
    SN    total first fragment length:    3075    # ignores clipping
    SN    total last fragment length:    3300    # ignores clipping
    SN    bases mapped:    6375    # ignores clipping
    SN    bases mapped (cigar):    6375    # more accurate
    SN    bases trimmed:    0
    SN    bases duplicated:    75
    SN    mismatches:    5    # from NM fields
    SN    error rate:    7.843137e-04    # mismatches / bases mapped (cigar)
    SN    average length:    75
    SN    average first fragment length:    75
    SN    average last fragment length:    75
    SN    maximum length:    75
    SN    maximum first fragment length:    75
    SN    maximum last fragment length:    75
    SN    average quality:    34.0
    SN    insert size average:    296.8
    SN    insert size standard deviation:    65.9
    SN    inward oriented pairs:    26
    SN    outward oriented pairs:    0
    SN    pairs with other orientation:    0
    SN    pairs on different chromosomes:    0
    SN    percentage of properly paired reads (%):    98.8

    MAPQ    60    84

    which I read as very high quality alignments (all are MAPQ 60) with only 1 marked as a duplicate. The error rate is less than 0.1%, which is reasonable.

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    It is really hard to tell anything from this image. Can you disable View as Pairs option as well as show us the actual variant location that we can assess? You can also share a snippet bam file if you wish us to diagnose the issue. Instructions on how to send a bam snippet can be found in the article below. 

    https://gatk.broadinstitute.org/hc/en-us/articles/360035889671-How-do-I-submit-a-detailed-bug-report 

    1
    Comment actions Permalink
  • Avatar
    Chase Mateusiak

    The variant location is the BP in identified by the vertical bars, but yes, the browser isn't helpful here.

    I've attached a subset of the data. On this data, I ran the following commands to produce a result without specifying an interval, and then again with the interval. I am using GATK 4.6.0 now.

    ```

    gatk HaplotypeCaller \
        -I $INPUT \
        -R $REFERENCE_GENOME \
        -O $OUTPUT_VCF \
        -ERC GVCF \
        --sample-ploidy 1

    gatk CombineGVCFs \
        -R $REFERENCE_GENOME \
        -V $INPUT_GVCF_1 \
        -V $INPUT_GVCF_2 \
        -O $OUTPUT_COMBINED_GVCF

    gatk GenotypeGVCFs \
        -R $REFERENCE_GENOME \
        -V $OUTPUT_COMBINED_GVCF \
        -O $OUTPUT_VCF

    ```

    which yields this

    ```
    #CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    DA10_DA10
    CP022325.1    69348    .    T    A    2920.63    .    AC=0;AF=0.5;AN=1;DP=85;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;QD=29.56;SOR=0.824    GT:AD:DP:GQ:PL    0:4,0:4:99:0,119

    ```

    And then I adjusted the interval over which I am calling in haplotypecaller. The other two commands are the same:

    ```
    gatk HaplotypeCaller \
        -I $INPUT \
        -R $REFERENCE_GENOME \
        -O $OUTPUT_VCF \
        -ERC GVCF \
        --sample-ploidy 1 \
        -L "CP022325.1:69330-69370"
    ```

    and this is the output over the specific base pair in question (which is variant in TDY1754)

    ```
    #CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    DA10_DA10
    CP022325.1    69348    .    T    A    2920.63    .    AC=0;AF=0.5;AN=1;DP=97;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;QD=25.36;SOR=0.824    GT:AD:DP:GQ:PL    0:16,0:16:99:0,585

    ```

    I uploaded the subset data -- two bams and their indicies for samples DA10 and TDY1754. They're in:

    `cmateusiak_20240805.tar.gz`

    The reference genome can be downloaded from fungidb:

    https://fungidb.org/common/downloads/release-68/CneoformansKN99/fasta/data/FungiDB-68_CneoformansKN99_Genome.fasta

     

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Chase Mateusiak

    This looks like a bug. I will escalate this issue to a github issue. I will post the issue link here once added.

    Regards. 

    You can check the issue from the link below.

    https://github.com/broadinstitute/gatk/issues/8943 

    1
    Comment actions Permalink
  • Avatar
    Chase Mateusiak

    I'm not sure that this is helpful, but from looking at the gVCF files, it seems like there are multiple intervals, all with depth information, which span the base pair in question. When the interval is widened, a wider interval, the first in the gVCF file, is used for the depth in the genotyped cohort VCF. The larger spanning interval has lower depth than the narrow interval, which is used when the interval is narrowed.

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    This is probably a bug in the reference confidence emission algorithm when ploidy is set to 1. Our team will look into that. If genotype calls is not important for you but just using DPs then for the time being you may set ploidy to 2 to capture proper DP values no matter what interval you use. 

    1
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk