Unable to replicate the spanning deletions documentation
Hello,
I am trying to understand GATKs behavior regarding spanning deletions. I went over the documentation page, spanning or overlapping deletions (* allele) and wanted to replicate the behavior. I was unable to do so. Following is what I did.
GATK version used - 4.1.7.0
I simulated reads to observe the spanning deletions. The reads were split into four samples, in the same manner as the documentation. Each sample had 50 identical read pairs with the following alleles at a certain position
- sample 1 - 25 read pairs with reference and 25 read pairs with an ALT
- sample 2 - 50 read pairs with deletion
- sample 3 - 25 read pairs with deletion and 25 read pairs with an ALT
- sample 4 - 25 read pairs with deletion and 25 read pairs with reference
The deletion started 2bp upstream of the concerned position. Here is the image from the article modified to explain this case.
- These reads were mapped using BWA using:
bwa mem -C -M -t 10 /path/to/ref/hg19 R1.fastq R2.fastq > align.sam
- The sam file was split into multiple files, one per sample. For each sample sam file GATK HaplotypeCaller was run. The bamout option was used to analyze the realigned bam files. They look as expected except that instead of 25/25 reads for each allele for each sample there were 23/27 reads in all the samples. I am not sure why.
gatk --java-options "-Xmx2g" HaplotypeCaller -R /path/to/ref -I /path/to/sam --emit-ref-confidence BP_RESOLUTION -L /path/to/bed_file -O /path/to/gvcf_file -bamout /path/to/sample.bam
- The four gvcf files were then merged to generate the vcf file using:
gatk GenomicsDBImport --arguments_file /file/containing/variants/arguments --genomicsdb-workspace-path /path/to/gendb/ --intervals /path/to/bed_file
gatk GenotypeGVCFs -R /path/to/ref -V gendb://gendb/ -L /path/to/bed_file --all-sites --include-non-variant-sites -O /path/to/final_vcf_file
For the position of interest, the expected outcome according to the article linked above was:
- sample 1 - REF/ALT
- sample 2 - */*
- sample 3 - ALT/*
- sample 4 - REF/*
However the observed calls were:
- sample 1 - REF/ALT
- sample 2 - ./. with GQ 0
- sample 3 - ALT/*
- sample 4 - REF/REF with GQ 0
The GT for samples 2 and 4 was not as expected. Following are the other fields for all the samples (NOTE: The REF and ALT were changed to match my actual case):
- REF - C
- ALT - A,*
- FORMAT - GT:AD:DP:GQ:PL
- sample 1 - 0/1:27,23,0:50:99:386,0,517,463,582,1045
- sample 2 - ./.:0,50,50:50:.:0,0,0,0,0,0
- sample 3 - 1/2:0,23,27:50:99:1810,1134,1064,729,0,884
- sample 4 - 0/0:27,23,23:50:0:0,0,44,0,44,44
Could someone please answer the following questions that come to my mind:
- Why did the DPs changed from 25/25 to 23/27 when the HaploytpeCaller realigned the reads
- More importantly, for sample 2 and sample 4, why did the GT not match the GT as shown in the article.
Please do let me know if any file need to be shared, and where I should share them.
Thank you,
Saurabh Parikh
-
Hi,
Our first recommendation is always to investigate the bamout for these regions. This will help determine how HC does realignment.
Secondly, please provide synthetic bam so we can investigate this. Please follow instructions here for info on how to share data with us: https://gatk.zendesk.com/hc/en-us/articles/360035889671
-
Hello Bhanu,
I looked at the bamout files for these regions, and they seem to match what the final VCF file shows with respect to the number of reads. I do not know why the ratio of the number of reads to each allele changes though. This ratio is different from the original bam file. HC seems to drop some of the reads, that too in a disproportionate manner. Why does HC drop some of the reads even though all of them are identical?
I have uploaded the files to the server with the name parikh_spanning_deletions_data.zip
I have included an other_files folder in case you want to take a look at those. Following is the directory structure.
- ./bam_per_RG/ - Contains the synthetic bam and their index files for each sample
- ./commands.txt - The commands I used to generate the final files
- ./other_files/gatk.bed - Used to generate output for the interval of interest
- ./other_files/gvcf_per_RG/ - The gvcf files and the bamout files generated per sample
- ./other_files/logs - Logs of HC, GenomicsDBImport, and GenotypeGVCFs
- ./other_files/samples.vcf - The final vcf file generated
Thanks,
Saurabh Parikh
-
Hi,
Take a look at this doc for explanation on how HaplotypeCaller reassembles reads: https://gatk.broadinstitute.org/hc/en-us/articles/360036227612-Local-re-assembly-and-haplotype-determination-HaplotypeCaller-and-Mutect2-
Does the bamout also explain the observed genotype? Can you please post IGV screenshots of the regions where you dont see the expected GT.
-
Hello Bhanu,
Okay, thanks for sharing the link. I will check it out.
The bamout for all samples are as expected. According to my understanding the spanning deletion should still be present and be called according to the documentation shared above. Just to note, the genotype is correctly called for all the samples at the site of the deletion, however it is miscalled where the spanning deletion is present. For the file that I shared above, the calls are:
- attribute: Position of deletion ------- Position of spanning deletion
- REF: TGCC ------- C
- ALT: T ------- A,*
- FORMAT: GT:AD:DP:GQ:PL ------- GT:AD:DP:GQ:P
- sample 1: 0/0:50,0:50:99:0,120,1800 ------- 0/1:23,27,0:50:99:694,0,595,763,676,1439
- sample 2: 1/1:0,50:50:99:2250,151,0 ------- ./.:0,50,50:50:.:0,0,0,0,0,0
- sample 3: 0/1:27,23:50:99:884,0,1064 ------- 1/2:0,27,23:50:99:1729,966,884,828,0,1064
- sample 4: 0/1:27,23:50:99:884,0,1064 ------- 0/0:27,23,23:50:0:0,0,44,0,44,44
Here are the images from IGV for the bamout files from sample 2 and sample 4 respectively.
Sample 2:
Sample 4:
Here are the images for the remaining samples that show the spanning deletion correctly.
Sample 1:
Sample 3:
Thanks,
Saurabh Parikh
-
Hi,
Thank you so much for sharing this detailed analysis with us. We want to duplicate this issue on our end. Which hg19 reference are you using? Can you please share the md5 and contigs from the reference. Also please provide seq dict.
-
Hello Bhanu,
Thanks!
The bam files were created using the UCSC hg19 fasta. It contained chr1-22, chrX, chrY, chrM, and three other custom sequences. My internet speed doesn't allow me to transfer the entire fasta file. I tried multiple times, it kept failing. The names of the files were parikh_hg19.zip and parikh_hg19_reupload.zip, in case you want to delete them from the server.
Do note, I was able to replicate the issue using the standard fasta file found at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/.
Should I recreate a bam file with some other reference? Do let me know.
Sincerely,
Saurabh Parikh
-
That should be fine. I have created a issue ticket to investigate this: https://github.com/broadinstitute/gatk/issues/6588. You can follow its progress in the issue ticket.
-
Okay, thanks Bhanu. I'll follow the ticket.
Please sign in to leave a comment.
8 comments