Fungal mitochondrial variant calling
Hi Megan,
Thanks for the quick response.Megan Shand
I am working on fungal mitochondrial variants. I am wondering if the new mitochondrial pipeline is only for humans and I should use other tools like HaplotypeCaller or Mutect2 mitochondria mode.
The sequence file was downloaded from the NCBI database. After extracting the fastq files, I did not check the quality or trim the files. Instead, I tried the methods to clean the data referring to this article https://gatk.broadinstitute.org/hc/en-us/articles/360039568932--How-to-Map-and-clean-up-short-read-sequence-data-efficiently
The species I am working on is not a model one. So I guess I should do bootstrap to get trusted sites and do the BQSR. For this step, I have a total of 88 samples, should I use all of their BAM files to infer the trusted sites or just one is enough?
For the variant calling, I am quite struggling to decide which tool to use, any suggestion will be appreciated.
Is this workflow correct?
Thank you very much!
The GATK version I am using is 4.1
These are the commands used
Data: DRR022915_1.fastq;DRR022915_2.fastq
Reference mitochondrial genome: AF293.fasta
1.Clean up
- FastqToSam # Converts a FASTQ file to an unaligned BAM or SAM file
java -jar picard.jar FastqToSam F1=DRR022915_1.fastq F2=DRR022915_2.fastq O=unaligned_DRR022915_pairs.bam SM=IFM59355-1 RG=DRR022915
- Mark adapter sequences using MarkIlluminaAdapters
java -Xmx8G -jar picard.jar MarkIlluminaAdapters I=unaligned_DRR022915_pairs.bam O=unaligned_DRR022915_markilluminaadapters.bam M=DRR022915_markilluminaadapters_metrics.txt TMP_DIR= ./tmp
- Align reads with BWA-MEM and merge with uBAM using MergeBamAlignment
java -jar ../../../picard.jar FastqToSam I=unaligned_DRR022915_markilluminaadapters.bam FASTQ=DRR022915_samtofastq_interleaved.fq CLIPPING_ATTRIBUTE=XT CLIPPING_ACTION=2 INTERLEAVE=true NON_PF=true TMP_DIR=./tmp
bwa index -a bwtsw AF293.fasta
bwa mem -M -t 7 -p AF293.fasta DRR022915_samtofastq_interleaved.fq > DRR022915_bwa_mem.sam
gatk-launch CreateSequenceDictionary -R AF293.fasta
java -Xmx16G -jar ../../../picard.jar MergeBamAlignment R=AF293.fasta UNMAPPED_BAM=unaligned_DRR022915_pairs.bam ALIGNED_BAM=DRR022915_bwa_mem.sam O=DRR022915_mergebamalignment.bam CREATE_INDEX=true ADD_MATE_CIGAR=true CLIP_ADAPTERS=false INCLUDE_SECONDARY_ALIGNMENTS=true PRIMARY_ALIGNMENT_STRATEGY=MostDistant ATTRIBUTES_TO_RETAIN=XS TMP_DIR=./tmp
2.Duplicates Marking
DRR022915_AF293_aln_head.sorted.bam =DRR022915_mergebamalignment.bam
java -jar picard.jar MarkDuplicates INPUT=DRR022915_AF293_aln_head.sorted.bam OUTPUT=DRR022915_AF293_aln_head.sorted.markdup.bam METRICS_FILE= DRR022915.metrics.txt
3.Base recalibration -- does it matter which program used to generate the initial vcf file?
4.Variant calling -- which program??
-
Another thing I notice is that mitochondria have very limited variants in their genomes. This species's mitogenome sizes are around 40kb.
I can only found several variants in DRR022915 compared to AF293 with HaplotypeCaller. In this case, is Base Recalibration recommended?
-
Hi,
I don't have any experience with fungal mitochondria samples, but I'll try to address what to look out for in choosing these tools.
For the bootstrapping of BQSR, the key is that you want to have an overwhelming amount of sites that are truly non-variant compared to the number of sites you have that might have variation. For the human genome this is achieved easily by removing common variant sites. Depending on the size of the fungal mitochondria and how polymorphic you expect it to be you might only need to bootstrap with one of your samples, or you could use more.
For the variant calling: do you know the ploidy of the fungal mitochondria or is the copy number so high (as it is in humans) that you expect to see various low allele fractions at variant sites? We use Mutect2 for the mitochondria calling because we expect the ploidy is so high that it doesn't make sense to use HaplotypeCaller and we expect low allele fraction variants (such as 5% of the reads having an alternate allele for example).
If you do choose to use the full mitochondria pipeline it includes a shifted (or rotated) reference. If the fungal mitochondria is circular and you are concerned about the quality of calls near an artificial breakpoint in the reference, then it might be worth generating shifted reference files. We are working on tools to allow you to do that easily, but they are unfortunately not available yet. If you don't expect the edge of the reference to be highly variable or low coverage, then you might be able to get away with just running the reference as is with the pipeline you've described.
Hopefully that helps a bit! I think you'll have to do a good deal of exploring your particular dataset to be sure that the tools are working for this use case. Best of luck!
-
Sorry, I didn't see your last comment before I replied! If these samples don't have many variants, then you can still run BQSR, but without removing many known variant sites (since there aren't that many). Since most of the data should match the reference there will be enough data to train the BQSR model without worrying about real variant sites. Or you could just remove the sites you've already found without worrying about using all of the samples.
-
Hi Megan,
Thanks for the detailed reply.
I will try to run BQSR.
For the fungal mitogenome, it usually has thousands of copies, some introns, and heteroplasmy could exist. I think Mutect2 mitochondria mode fits my requirement.
Another question that haunted me is which preprocess is preferred. With the high copy number of mitogenome, is it necessary to trim the reads or to clean the data as this article (https://gatk.broadinstitute.org/hc/en-us/articles/360039568932--How-to-Map-and-clean-up-short-read-sequence-data-efficiently) talks about? Can I just skip it?
Thanks~
-
I think trimming adapters depends more on the sequencing methods than the fact that it is a fungal mitogenome. If you choose to skip that step I'd just be sure to look at your bams in IGV to make sure there isn't excessive adapter sequence from short inserts.
Please sign in to leave a comment.
5 comments