Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Mutect 2 on mitochondria DNA variant discovery

0

26 comments

  • Avatar
    Gökalp Çelik

    Hi Ernestine Kubi Amos-Abanyie

    Can you tell us more about how you mapped your nanopore reads? We don't expect to have any pairings between nanopore reads therefore something is probably not right about your bam input. What mapper did you use?

    Can you run ValidateSamFile on that bam file?

    0
    Comment actions Permalink
  • Hi Gökalp Çelik,

     

    Sorry my response is coming late,

    both the nanopore reads and the short reads (linked reads) were mapped using this command line.

    Thank you

    for fastq_file in $input_dir/*.fastq.gz; do
            sample_name=$(basename "$fastq_file" .fastq.gz)
            minimap2 -ax map-ont --secondary=no -t 40 -N 5 -R "@RG\tID:$sample_name\tSM:$sample_name" $reference_genome $fastq_file | 
                    samtools sort -m8G -@48 -o  $output_dir/$sample_name.bam -
                            samtools index  $output_dir/$sample_name.bam
    done
    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    What is the version of minimap2 that you are using during this mapping?

    0
    Comment actions Permalink
  • version 2.24-r1122

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi again.

    I tried to replicate a similar scenario with the data that I have but I have not been able to replicate this error. There are a bunch of questions that I have to ask before proceeding.

    1- What is the unaligned.bam file you have in your command line? -ALIGNED parameter expects a properly aligned sam or bam file. 

    2- Are you combining your mapped bam  files after minimap2 step?

    Looking for a paired long read is not an expected behavior but it looks like some of your reads might have been tagged improperly inside the bam file. 

    Can you return the result of the command 

    samtools view -f 0x1 

    for your aligned ont reads ?

    0
    Comment actions Permalink
  • yes when i run this the file generated is 0kb. is it possible to have a meeting with me via zoom ?

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    We are unable to provide any support outside of this forum. We can continue debugging your case from here no problem. 

    So if your aligned reads don't have any pair tags then there must be something within unmapped reads that you are merging. How did you generate those unmapped.bam files?

    0
    Comment actions Permalink
  • Sure no problem hopefully i get passed this. the unmapped bam file was generated using this command from your pipeline.  Also what about the errors from the short reads, or we will look at that after debugging the nanopore reads. 

    To answer this : Are you combining your mapped bam  files after minimap2 step? I did not perform minimap2 step.

    question: If I am using mutect2 pipeline on mitochondria and this is not somatic cell line do I still need to go through all the steps or can directly do variant calling on just the Bam file.

    Also can I send you the Bam file for you to look at incase there is something I am missing.

     

    Thank you

    gatk RevertSamSpark  \\
          -I input.bam \\
          -O reverted.bam
     
    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Ernestine Kubi Amos-Abanyie

    RevertSamSpark is still in beta stage and we do not recommend it's usage within production workflows. 

    Since you are working with long unpaired read we doubt that you will observe any added benefits from using MergeBamAlignment step other than adding additional tags to aligned reads from unmapped reads for special purposes therefore you may be able to proceed to further stages without MergeBamAlignment step for your long read samples. Also the warning message you receive for your short reads is expected and does not cause any issues in the downstream analysis. 

    I hope this helps.

    Regards. 

    0
    Comment actions Permalink
  • ok so you suggest I use a different tool to revert the aligned Bam file to generate the unmapped BAM files that is for the short reads? Also can you point me to a document on what the Shifted Bam file is? 

    Also the warning message you receive for your short reads is expected and does not cause any issues in the downstream analysis. : I do not get an output for this even though its supposed to be a normal warning .

     

    And for the nanopore reads I can jump the MergeBam alignment step and follow the steps to variant calling?

    Can I do same for the short reads as well?

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Ernestine Kubi Amos-Abanyie

    We recommend FastqToSam or RevertSam (non-spark) tools to generate unmapped bam files. 

    Also looking at the command you use to map your reads, we recommend using bwa mem for short reads whereas minimap2 is only recommended for long reads from nanopore or pacbio. 

    You mentioned about linked reads what do you mean by that? Do you mean paired-end sequenced short reads?. 

    0
    Comment actions Permalink
  • Gökalp Çelik 

    The linked reads is a liitle bit different technology compared to the paired end short reads. In brief, the a long DNA fragment is first split into ~40-50Kb parts, and all short reads that come from these parts get the same barcode. This means that you can reassign short reads back to the longer read. And for these were aligned using LongRanger.

    Also, this step MergeBam alignment and further steps requires the shifted reference aligned Bam can you point me to what this file is about and what tool is used to generate that if you do have any documentation on that.

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Ernestine Kubi Amos-Abanyie

    I am not aware of any application of MergeBamAlignment with linked reads but I will sure ask the team to answer about it. 

    Shifted reference is used to map reads to the circular genome of mitochondria to eliminate issues of end of reference mapping. It is only used at 2 steps where you map all the reads to shifted reference and recall variants using the shifted reference all of which then be liftedover to the unshifted reference for proper merging of variants and refinement of calls. The following tool is used to generate the shifted fasta 

    gatk ShiftFasta 

    You may check the tool and try yourself to get a grip of what it does. Basically it rotates the circular reference file towards another edge so that bases located at the ends of the original string moves towards to middle to get a better variant calling performance. Tool also generates a liftover file to shift those variant sites back to the original reference and locate them at proper positions in the complete VCF file. 

    I hope this helps. 

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi again. 

    We have an input from our team about MergeBamAlignment step. That step is especially important to have all read metadata contained within bam file however in your case you have linked long reads and conventional long reads from different technologies therefore you may skip MergeBamAlignment steps if you think that you do not need any additional metadata added to your reads (Which I believe you don't otherwise longranger may not work properly).

    I hope this helps.  

    0
    Comment actions Permalink
  • Avatar
    Ernestine Kubi Amos-Abanyie

    Thank you. this has been helpful and will definately come back should i hit a road block 

    0
    Comment actions Permalink
  • Avatar
    Ernestine Kubi Amos-Abanyie

    Hi Gökalp Çelik,

    Its me again. Thank you for responding to my concerns and I think I have made a good headway in understanding how this tool works and prerequistes I will summarize my understanding kindly confirm and I also have a few questions.

    1. extract Mitochondria read from WGS aligned BAM

    2. Unmap the mitochondria Read, convert to fastq file and realign with both normal chrM reference fasta and shifted reference

    3. Merge the realigned to the unaligned BAM file (for both refernce and shifted reference)

    4. Mark duplicates for both file

    5 Variant call using chrM reference for reference alinged_merged BAM and shifted reference for shifted refeence aligned_merged BAM.

    6 liftover shifted vcf

    7. merge reference vcf with liftover_shift vcf and the stat files

    Now my question is for long reads which of these preprocessing steps do I skip before varaint calling. I am asking this because I think I run this process for variant calling for the both short read and long read for same individual. the steps were succesful thinking I made no errors now  I am repeating for same sample again only to be getting just 2 varaint as oppose to about 200 from the initial tial (note in this trial I did not skip any step but now going through same and I feel like something is not right) 

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    I am not sure if I understood it correctly. Do you have both short and long reads for the same sample? If so you may try running the same procedure for both and combine your findings possibly to get a better map and get rid of sequencing artifacts due to short and long reads individually. 

    Shifting the reference is a necessary step since the mitochondrial genome is circular and mappings at the ends of the reference tend to get messier once reads get clipped and split. We want to avoid that and get a better picture of the whole circular genome. 

    For the long reads you may not need the merging step of unmapped bam file with the aligned bam file since long reads are more complete and less in need for replacing hardclips with softclips. 

    I hope this helps.

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Ernestine Kubi Amos-Abanyie

    First of all I wanted you to confirm if the steps as laid out assuming I have both the reference and shifted referece files is correct.

    And yes I have both short and naanopore long reads for same individual. However, after I did the first test run for this sample on both files to generate my vcf files (there is a reason we are doing both short reads and nonopore reads). I however I mistakenly deleted the files and repeating the steps again for both short and long reads. I have been able to replicate the same short read only the same long read is giving me a totally different output and I am wondering what I am doing wrong. or this entire steps does not favour long reads and peharps the first read I probably used the wrong file to generate the initial vcf file for the long read.(i hope my long essay makes sense)

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Our mitochondria pipeline is geared torwards short reads with high depth. For long reads I am not sure what goes wrong with your data. Can you elaborate more about what is different between 2 runs of long reads ?

    0
    Comment actions Permalink
  • Avatar
    Ernestine Kubi Amos-Abanyie

    with the first run the vcf merged vcf file (from control and non control region) had like 200 variants when I did the variant count however with the rerun the noncontrol region generate a file with only 2 variants and the control region non previously i got 12 variants from the control region 

    0
    Comment actions Permalink
  • Avatar
    Ernestine Kubi Amos-Abanyie

    Also confirm if my step by step of the pipleline for me

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Can you check if you are using the same fastq input for both runs? Looks like something must have been changed dramatically in between 2 runs. This work flow is quite deterministic in terms of reproducibility therefore there must be something discordant between 2 runs. 

     

    0
    Comment actions Permalink
  • Avatar
    Ernestine Kubi Amos-Abanyie

    Probably not. Probably for the first run I must have used the fastq generated for short read to generate the sam file using minimap2. Il repeat this assumption and feed back. However are my outlined steps correct?

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Minimap2 vs BWA MEM could cause a difference between runs however I am not sure if this explains the lack of all those variants. 

    Your steps are correct. 

     

    0
    Comment actions Permalink
  • Avatar
    Ernestine Kubi Amos-Abanyie

    Also has this worked on Mitochondria for Whole Exome sequencing.

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    We would recommend against it since Exome sequencing can leak Mitochondria DNA but only a small fraction of what gets along with the capture probes. Results may not be desirable. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk