I am having a few problems with SNP calling.
I have an ancient human exome dataset which I mapped with bwa aln to the hg38. For each chromosome I have about 200K to 500K reads mapping to my reference genome after duplicate removal (bwa aln, bwa samse, Picard CleanSam, Picard SortSam, Samtools -b -F4 -q25, awesam.sh collapser).
I am trying to follow the best practices for single sample variant discovery (therefore I follow the vcf and not gvcf pipeline) and filtering my dataset but my bamout file indicates that I am losing a great amount of my data (for example for chromosome 1 I had 500K reads, and the bamout file after HC gave me 4K reads). My samtools flagstat output shows that I have 6,811,953 QC passed reads and ValidateSamFile shows no errors. I am aware that HC reassembles the alignment but my bam file before and after HC looks completely different, almost as if bwa mapped a different dataset.
I am using GATK 188.8.131.52 locally and below are my commands:
java -jar gatk-184.108.40.206/gatk-package-220.127.116.11-local.jar HaplotypeCaller -R Homo_sapiens_assembly38.fasta -I myfile.bam -O myfile_HC.vcf -bamout myfile_bamout.bam
java -jar gatk-18.104.22.168/gatk-package-22.214.171.124-local.jar VariantFiltration -R Homo_sapiens_assembly38.fasta -V myfile_HC.vcf --filter-expression "DP<5" --filter-name "DP" -O myfile_HC_VF.vcf
I have also tried HC with the -L parameter to test it by chromosome, but I still got the same amount of reduced reads. Also the SNPs I get after VF are not present in all the reads, so I'm wondering if there is any parameter for the SNP coverage rather than the depth of coverage. Because I am working with ancient DNA I need to show that the SNP is present in all reads and not a sequencing artefact.
Any advice is appreciated.
Thank you for your continuous support!
Please sign in to leave a comment.