Hello! I am a new user to your pipeline! And so far I love how intuitive it is and easy to implement. However I am having a hard time estimating certain proportions. Here's what I am doing:
I have 13 DNA sequenced samples from Plasmodium falciparum (haplotype, ploidy = 1), where I sequence a gene that is known to confer resistance to a treatment. In each sample, I am expecting multiple genotypes from this strain, but I am interested in estimating the proportion of each genotype. Here's what I've done so far:
1. BWA to map raw reads to my reference gene, followed by index of my reference files (not included in this message):
bwa mem 3d7_dhfr.fasta S13_R1.fastq S13_R2.fastq > S13_dhfr.sam
2. Mark duplicates with Picard:
picard MarkDuplicates -I S13_dhfr.bam -O S13_dhfr_markeddup.bam -M S13_dup_dhfr_metrics.txt
3. Adding read group headers to my sam file followed by picard ValidateSamFile (the latter not included here):
picard AddOrReplaceReadGroups -I S13_dhfr_markeddup.bam -O S13_dhfr_RG.bam -SM 13 -LB lib1 -PL ILLUMINA -PU unit13
5. HaplotypeCaller with these parameters:
gatk HaplotypeCaller -I S13_dhfr.bam -O S13_dhfr.vcf -R 3d7_dhfr.fasta -ploidy 1 -bamout --standard-min-confidence-threshld-for-calling 30 -ERC GVCF
6. Merge my .vcf files into a single vcf file using gatk (to join 13 samples where I ran the previous commands in each one of them):
gatk CombineGVCFs -R 3d7_dhfr.fasta -V S1_dhfr.vcf -V S2_dhfr.vcf ... -V S13_dhfr.vcf -O dhfr_merged.vcf
7. Estimating genotypes from the merged vcfs
gatk GenotypeGVCFs -V dhfr_merged.vcf -O genotypes_dhfr.vcf -R 3d7_dhfr.fasta
8. Hard filtering on variants by (i) a minimum of 50 reads per nucleotide position, (ii) 1% of depth to support that particular SNP or indel, and (iii) a minimum phred score quality of 30
gatk VariantFiltration -V genotypes_dhfr.vcf -O dhfr_filtered.vcf -filter "QUAL < 30" --filter-name "QUAL30" -filter "AF < 0.01" --filter-name AF0.01 -filter "DP < 50" --filter-name "DP50"
After this is where I'm lost even after watching the lecture videos you guys kindly provided in April of 2020. I tried doing genotype posterior, but then I realized that's only for diploid organisms to calculate homozygous or heterozygous alleles (which is not my case). My questions are:
(i) Which gatk command should I use to estimate the genotype proportions of SNPs found within an Illumina read size (SNPs less than 200 bp from each other), i.e.:
SNP1 - SNP2 proportion
WT1 - SNP2 proportion
SNP1 - WT2 propotion
WT1 - WT2 proportion
(ii) How does gatk handle genotype calling when the SNPs are not supported by the same short read? For instance, I noticed I have an indel 700 bp away from SNP2. Does gatk call this genotype separately independent from the previous 2 SNPs I used as an example? Or does it perform haplotype reconstruction on short reads?
I am sorry if this is a discussion that has been published in the past, but as fas as I can tell I couldn't find anything similar.
Thanks for your valuable time and help!
Please sign in to leave a comment.