Estimate genotype proportions from filtered variants
Hello! I am a new user to your pipeline! And so far I love how intuitive it is and easy to implement. However I am having a hard time estimating certain proportions. Here's what I am doing:
I have 13 DNA sequenced samples from Plasmodium falciparum (haplotype, ploidy = 1), where I sequence a gene that is known to confer resistance to a treatment. In each sample, I am expecting multiple genotypes from this strain, but I am interested in estimating the proportion of each genotype. Here's what I've done so far:
1. BWA to map raw reads to my reference gene, followed by index of my reference files (not included in this message):
bwa mem 3d7_dhfr.fasta S13_R1.fastq S13_R2.fastq > S13_dhfr.sam
2. Mark duplicates with Picard:
picard MarkDuplicates -I S13_dhfr.bam -O S13_dhfr_markeddup.bam -M S13_dup_dhfr_metrics.txt
3. Adding read group headers to my sam file followed by picard ValidateSamFile (the latter not included here):
picard AddOrReplaceReadGroups -I S13_dhfr_markeddup.bam -O S13_dhfr_RG.bam -SM 13 -LB lib1 -PL ILLUMINA -PU unit13
5. HaplotypeCaller with these parameters:
gatk HaplotypeCaller -I S13_dhfr.bam -O S13_dhfr.vcf -R 3d7_dhfr.fasta -ploidy 1 -bamout --standard-min-confidence-threshld-for-calling 30 -ERC GVCF
6. Merge my .vcf files into a single vcf file using gatk (to join 13 samples where I ran the previous commands in each one of them):
gatk CombineGVCFs -R 3d7_dhfr.fasta -V S1_dhfr.vcf -V S2_dhfr.vcf ... -V S13_dhfr.vcf -O dhfr_merged.vcf
7. Estimating genotypes from the merged vcfs
gatk GenotypeGVCFs -V dhfr_merged.vcf -O genotypes_dhfr.vcf -R 3d7_dhfr.fasta
8. Hard filtering on variants by (i) a minimum of 50 reads per nucleotide position, (ii) 1% of depth to support that particular SNP or indel, and (iii) a minimum phred score quality of 30
gatk VariantFiltration -V genotypes_dhfr.vcf -O dhfr_filtered.vcf -filter "QUAL < 30" --filter-name "QUAL30" -filter "AF < 0.01" --filter-name AF0.01 -filter "DP < 50" --filter-name "DP50"
After this is where I'm lost even after watching the lecture videos you guys kindly provided in April of 2020. I tried doing genotype posterior, but then I realized that's only for diploid organisms to calculate homozygous or heterozygous alleles (which is not my case). My questions are:
(i) Which gatk command should I use to estimate the genotype proportions of SNPs found within an Illumina read size (SNPs less than 200 bp from each other), i.e.:
SNP1 - SNP2 proportion
WT1 - SNP2 proportion
SNP1 - WT2 propotion
WT1 - WT2 proportion
(ii) How does gatk handle genotype calling when the SNPs are not supported by the same short read? For instance, I noticed I have an indel 700 bp away from SNP2. Does gatk call this genotype separately independent from the previous 2 SNPs I used as an example? Or does it perform haplotype reconstruction on short reads?
I am sorry if this is a discussion that has been published in the past, but as fas as I can tell I couldn't find anything similar.
Thanks for your valuable time and help!
-
Is the -bamout option from the VariantCaller doing that maybe? I realized what I am asking for is something doable from the HaplotypeCaller. Would it work if I obtain a bamout file from my filtered variantes to calculate this proportion?
-
I am a bit confused about the information you are looking for here. What do you mean by this?
SNP1 - SNP2 proportion
WT1 - SNP2 proportion
SNP1 - WT2 propotion
WT1 - WT2 proportion
Could you explain further the meaning of your questions?
The -bamout option is a great tool for troubleshooting individual variant sites to see what the reads look like that support the variant. There will only be regions in the bamout that have variant sites.
Best,
Genevieve
-
Hi Genevieve,
Thanks for your response. Of course, allow me to elaborate!
In my P. falciparum gene of interest there are two mutants that are known to confer resistance to a treatment. These two mutants are characterized by a single SNP in two different nucleotide positions. Fortunately for me, these two positions differ by less than 200 bp from each other, and because I used an Illumina instrument (2x250) I know that theoretically I should be able to retrieve only those reads spanning these two SNPs after I perform the VariantFiltration step. Therefore I am interested in retrieving only those reads and compare how many of them support the SNP or the reference wild type. For simplicity, let's assume the SNP positions are in base pair #1 and base pair #200
R = nucleotide same as reference (wild type)
* = SNP (mutant)
In the example above (assuming I only have 10 reads), I would be getting the following genotype proportions:
Position (bp)1 2 3 ... 198 199 200 201 Read number 1 * R R ... R R * R 2 * R R ... R R * R 3 R R R ... R R * R 4 R R R ... R R R R 5 R R R ... R R R R 6 R R R ... * R R R 7 * R R ... R R R R 8 * R R ... R R R R 9 * R R ... R R R R 10 R R * ... R R R R Genotype #1 (SNP at position 1 AND SNP at position 200) = 20%
Genotype #2 (Reference at position 1 AND SNP at position 200) = 10%
Genotype #3 (Reference at position 1 AND reference at position 200) = 40%
Genotype #4 (SNP at position 1 AND reference at position 200) = 30%
I hope this makes sense! We are interested in the proportions of these combinations spanning two nucleotide positions to see how many intra-host parasites are in there.
Thanks again,
-
Thank you Daniel Castañeda Mogollón, I'm not sure we have a specific GATK tool that can do this. I'll reach out to my team to get their insight, it may take me some time to get an answer on this.
-
I have an update regarding your question. I don't think you need a specific tool for measuring this, because you can probably use the Allele Depth for what you are trying to measure. Here's the tool doc page for this annotation:
https://gatk.broadinstitute.org/hc/en-us/articles/5358854704411-DepthPerAlleleBySample
Best,
Genevieve
Please sign in to leave a comment.
5 comments