Loss of data after HaplotypeCaller
Dear GATK,
I am having a few problems with SNP calling.
I have an ancient human exome dataset which I mapped with bwa aln to the hg38. For each chromosome I have about 200K to 500K reads mapping to my reference genome after duplicate removal (bwa aln, bwa samse, Picard CleanSam, Picard SortSam, Samtools -b -F4 -q25, awesam.sh collapser).
I am trying to follow the best practices for single sample variant discovery (therefore I follow the vcf and not gvcf pipeline) and filtering my dataset but my bamout file indicates that I am losing a great amount of my data (for example for chromosome 1 I had 500K reads, and the bamout file after HC gave me 4K reads). My samtools flagstat output shows that I have 6,811,953 QC passed reads and ValidateSamFile shows no errors. I am aware that HC reassembles the alignment but my bam file before and after HC looks completely different, almost as if bwa mapped a different dataset.
I am using GATK 4.0.12.0 locally and below are my commands:
HaplotypeCaller
java -jar gatk-4.0.12.0/gatk-package-4.0.12.0-local.jar HaplotypeCaller -R Homo_sapiens_assembly38.fasta -I myfile.bam -O myfile_HC.vcf -bamout myfile_bamout.bam
VariantFiltration
java -jar gatk-4.0.12.0/gatk-package-4.0.12.0-local.jar VariantFiltration -R Homo_sapiens_assembly38.fasta -V myfile_HC.vcf --filter-expression "DP<5" --filter-name "DP" -O myfile_HC_VF.vcf
I have also tried HC with the -L parameter to test it by chromosome, but I still got the same amount of reduced reads. Also the SNPs I get after VF are not present in all the reads, so I'm wondering if there is any parameter for the SNP coverage rather than the depth of coverage. Because I am working with ancient DNA I need to show that the SNP is present in all reads and not a sequencing artefact.
Any advice is appreciated.
Thank you for your continuous support!
-
Bamout file is not your final bam file to keep. It only contains those haplotypes called by HC. It is only for debug/display purposes not for keeping as final bam.
-
Thank you for your input SkyWarrior.
mbesskd5 In addition to that, also take a look at this doc: https://gatk.broadinstitute.org/hc/en-us/articles/360035891111-Expected-variant-at-a-specific-site-was-not-called
-
Thank you both for your prompt response, I really appreciate it!
I did check the bamout with my initial bam and SNPs appear in the same positions, although, they are not covered in all reads in the bamout. However they are covered in all reads in my initial file. I'm guessing that has to do with the reassembly as this suggested? https://gatk.broadinstitute.org/hc/en-us/articles/360035891111-Expected-variant-at-a-specific-site-was-not-called
Also I read in the link that:
"Keep in mind that the depth reported in the DP field of the VCF is the unfiltered depth. You may believe you have good coverage at your site of interest, but since the variant callers ignore bases that fail the quality filters, the actual coverage seen by the variant callers may be lower than you think."
What is the unfiltered depth? Am I correct to understand that although I may see a DP=22 in my vcf the actual SNP coverage can only be half of the reads because of the ignoring of the low quality bases? In that case doesn't this mean the SNP is not actually a SNP, but an error or in my case DNA damage? Is there a way to filter the depth and only keep 100% SNP coverage?
Thanks a lot! :)
-
You may trim out bad reads but that is not the ultimate solution. All variant callers apply base quality and mapping quality filters that can be adjusted by user however often times users will not be aware of what is going on inside the whole bam file so this is a dangerous task. Canonically BQ and MQ above 20 is usually safe to include in your variant calling. Unless you work with very low coverage samples then there is really nothing to worry about the result. However if you are working with low coverage samples all the time then you may need to readjust your parameters and compare your call strength with each parameter set until you come up with an optimal solution.
-
Hi SkyWarrior
You are a star thank you. I have trimmed bad reads as the first step to my pipeline using AdapterRemoval, and I usually keep quality scores to 25-30, never below 20, and I follow this rule throughout my pipeline, precisely because I'm working with low coverage data (ancient DNA). And for every dataset I've had I used GATK and I got SNP calls that are not covered in all reads but are above Q>25.
Please sign in to leave a comment.
5 comments