Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Unable to replicate the spanning deletions documentation

0

8 comments

  • Avatar
    Bhanu Gandham

    Hi,

     

    Our first recommendation is always to investigate the bamout for these regions. This will help determine how HC does realignment.

    Secondly, please provide synthetic bam so we can investigate this. Please follow instructions here for info on how to share data with us: https://gatk.zendesk.com/hc/en-us/articles/360035889671

    0
    Comment actions Permalink
  • Avatar
    Saurabh Parikh

    Hello Bhanu,

    I looked at the bamout files for these regions, and they seem to match what the final VCF file shows with respect to the number of reads. I do not know why the ratio of the number of reads to each allele changes though. This ratio is different from the original bam file. HC seems to drop some of the reads, that too in a disproportionate manner. Why does HC drop some of the reads even though all of them are identical?

    I have uploaded the files to the server with the name parikh_spanning_deletions_data.zip

    I have included an other_files folder in case you want to take a look at those. Following is the directory structure.

    • ./bam_per_RG/ - Contains the synthetic bam and their index files for each sample
    • ./commands.txt - The commands I used to generate the final files
    • ./other_files/gatk.bed - Used to generate output for the interval of interest
    • ./other_files/gvcf_per_RG/ - The gvcf files and the bamout files generated per sample
    • ./other_files/logs - Logs of HC, GenomicsDBImport, and GenotypeGVCFs
    • ./other_files/samples.vcf - The final vcf file generated

     

    Thanks,

    Saurabh Parikh

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi,

     

    Take a look at this doc for explanation on how HaplotypeCaller reassembles reads: https://gatk.broadinstitute.org/hc/en-us/articles/360036227612-Local-re-assembly-and-haplotype-determination-HaplotypeCaller-and-Mutect2-

     

    Does the bamout also explain the observed genotype? Can you please post IGV screenshots of the regions where you dont see the expected GT.

    0
    Comment actions Permalink
  • Avatar
    Saurabh Parikh

    Hello Bhanu,

    Okay, thanks for sharing the link. I will check it out.

    The bamout for all samples are as expected. According to my understanding the spanning deletion should still be present and be called according to the documentation shared above. Just to note, the genotype is correctly called for all the samples at the site of the deletion, however it is miscalled where the spanning deletion is present. For the file that I shared above, the calls are:

    • attribute: Position of deletion ------- Position of spanning deletion
    • REF: TGCC ------- C
    • ALT: T ------- A,*
    • FORMAT: GT:AD:DP:GQ:PL ------- GT:AD:DP:GQ:P
    • sample 1: 0/0:50,0:50:99:0,120,1800 ------- 0/1:23,27,0:50:99:694,0,595,763,676,1439
    • sample 2: 1/1:0,50:50:99:2250,151,0 ------- ./.:0,50,50:50:.:0,0,0,0,0,0
    • sample 3: 0/1:27,23:50:99:884,0,1064 ------- 1/2:0,27,23:50:99:1729,966,884,828,0,1064
    • sample 4: 0/1:27,23:50:99:884,0,1064 ------- 0/0:27,23,23:50:0:0,0,44,0,44,44

    Here are the images from IGV for the bamout files from sample 2 and sample 4 respectively.

    Sample 2:

    Sample 4:

     

    Here are the images for the remaining samples that show the spanning deletion correctly.

    Sample 1:

    Sample 3:

     

    Thanks,

    Saurabh Parikh

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi,

     

    Thank you so much for sharing this detailed analysis with us. We want to duplicate this issue on our end. Which hg19 reference are you using? Can you please share the md5 and contigs from the reference. Also please provide seq dict. 

    0
    Comment actions Permalink
  • Avatar
    Saurabh Parikh

    Hello Bhanu,

    Thanks!

    The bam files were created using the UCSC hg19 fasta. It contained chr1-22, chrX, chrY, chrM, and three other custom sequences. My internet speed doesn't allow me to transfer the entire fasta file. I tried multiple times, it kept failing. The names of the files were parikh_hg19.zip and parikh_hg19_reupload.zip, in case you want to delete them from the server.

    Do note, I was able to replicate the issue using the standard fasta file found at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/

    Should I recreate a bam file with some other reference? Do let me know.

     

    Sincerely,

    Saurabh Parikh

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi Saurabh Parikh

     

    That should be fine. I have created a issue ticket to investigate this: https://github.com/broadinstitute/gatk/issues/6588. You can follow its progress in the issue ticket.

    0
    Comment actions Permalink
  • Avatar
    Saurabh Parikh

    Okay, thanks Bhanu. I'll follow the ticket.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk