Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

use LearnReadOrientationModel to build artifact-prior.tar.gz

Answered
0

5 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi chenglei,

    Thanks for writing in with your question! We can definitely help you figure this out.

    Have you seen this tutorial? (How to) Call somatic mutations using GATK4 Mutect2. There is a section titled A step-by-step guide to the new Mutect2 Read Orientation Artifacts Workflow. There is a great description covering the steps you are describing here in your post. 

    Please let me know if you have follow up questions after checking out the tutorial.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    chenglei

    Thank your response. But I am still confused. As it was pointed out in “A step-by-step guide to the new Mutect2 Read Orientation Artifacts Workflow”, “When multiple tumor samples are specified, you only need a single --f1r2-tar-gz output, which contains data for each tumor sample.”

    Do it means that this single --f1r2-tar-gz output will be used in “gatk LearnReadOrientationModel -I f1r2.tar.gz -O read-orientation-model.tar.gz” and generated a single read-orientation-model.tar.gz, and then this single read-orientation-model.tar.gz will be used in gatk FilterMutectCalls \

    -R ~/mydata/genome/hg38/hg38-genomefa/Homo_sapiens_assembly38.fasta\

    -V ~/mydata/03.gatk/b.SNP_Indel/tumorx.unfiltered.somatic.vcf.gz \

    --stats ~/mydata/03.gatk/b.SNP_Indel/tumorx.unfiltered.somatic.vcf.gz.stats \

    --filtering-stats tumorx.f1.vcf.gz.stats \

    --contamination-table ~/mydata/03.gatk/contamination/tumorx.contamination.table \

    --tumor-segmentation tumox.segments.tsv \

    --ob-priors read-orientation-model.tar.gz \

    -O tumorx.f1.vcf.gz \

     

    But I have 20 paired tumor-normal matched samples, each sample will generate one f1r2.tar.gz, and each f1r2.tar.gz will generate one corresponding read-orientation-model.tar.gz, the following is what I do:

    Sample 1

    gatk Mutect2 \

    -R ~/mydata/01.index/Homo_sapiens_assembly38.fasta \

    -I ~/mydata/03.gatk/a.BQSR/normal1.MarkDuplicates.BQSR.bam \

    -I ~/mydata/03.gatk/a.BQSR/tumor1.MarkDuplicates.BQSR.bam \

    -normal normal1 \

    --germline-resource ~/mydata/genome/hg38/af-only-gnomad.hg38.vcf.gz \

    --panel-of-normals ~/mydata/03.gatk/b.SNP_Indel/panel_of_normal/pon.vcf.gz \

    --f1r2-tar-gz tumor1.f1r2.tar.gz \

    -L ~/mydata/genome/hg38/hg38-genomefa/intervallist/S07604514_Regions.bed \

    -O tumor1.somatic_unfilterd.vcf.gz \

     

     

    Sample 2

    gatk Mutect2 \

    -R ~/mydata/01.index/Homo_sapiens_assembly38.fasta \

    -I ~/mydata/03.gatk/a.BQSR/normal2.MarkDuplicates.BQSR.bam \

    -I ~/mydata/03.gatk/a.BQSR/tumor2.MarkDuplicates.BQSR.bam \

    -normal normal2 \

    --germline-resource ~/mydata/genome/hg38/af-only-gnomad.hg38.vcf.gz \

    --panel-of-normals ~/mydata/03.gatk/b.SNP_Indel/panel_of_normal/pon.vcf.gz \

    --f1r2-tar-gz tumor2.f1r2.tar.gz \

    -L ~/mydata/genome/hg38/hg38-genomefa/intervallist/S07604514_Regions.bed \

    -O tumor2.somatic_unfilterd.vcf.gz \

     

    Sample x

    gatk Mutect2 \

    -R ~/mydata/01.index/Homo_sapiens_assembly38.fasta \

    -I ~/mydata/03.gatk/a.BQSR/normalx.MarkDuplicates.BQSR.bam \

    -I ~/mydata/03.gatk/a.BQSR/tumorx.MarkDuplicates.BQSR.bam \

    -normal normalx \

    --germline-resource ~/mydata/genome/hg38/af-only-gnomad.hg38.vcf.gz \

    --panel-of-normals ~/mydata/03.gatk/b.SNP_Indel/panel_of_normal/pon.vcf.gz \

    --f1r2-tar-gz tumorx.f1r2.tar.gz \

    -L ~/mydata/genome/hg38/hg38-genomefa/intervallist/S07604514_Regions.bed \

    -O tumorx.somatic_unfilterd.vcf.gz \

     

    The abovementiones steps will generate x counts .f1r2.tar.gz files,

     

    gatk LearnReadOrientationModel -I tumor1. f1r2.tar.gz -O tumor1.read-orientation-model.tar.gz

     

     

    gatk LearnReadOrientationModel -I tumor2. f1r2.tar.gz -O tumor2.read-orientation-model.tar.gz

     

    gatk LearnReadOrientationModel -I tumorx. f1r2.tar.gz -O tumorX.read-orientation-model.tar.gz

     

     

    gatk FilterMutectCalls \

    -R ~/mydata/genome/hg38/hg38-genomefa/Homo_sapiens_assembly38.fasta\

    -V ~/mydata/03.gatk/b.SNP_Indel/tumor1.unfiltered.somatic.vcf.gz \

    --stats ~/mydata/03.gatk/b.SNP_Indel/tumor1.unfiltered.somatic.vcf.gz.stats \

    --filtering-stats tumor1.f1.vcf.gz.stats \

    --contamination-table ~/mydata/03.gatk/contamination/tumor1.contamination.table \

    --tumor-segmentation tumor1.segments.tsv \

    --ob-priors tumor1.read-orientation-model.tar.gz \

    -O tumorx.f1.vcf.gz \

     

    gatk FilterMutectCalls \

    -R ~/mydata/genome/hg38/hg38-genomefa/Homo_sapiens_assembly38.fasta\

    -V ~/mydata/03.gatk/b.SNP_Indel/tumor2.unfiltered.somatic.vcf.gz \

    --stats ~/mydata/03.gatk/b.SNP_Indel/tumor2.unfiltered.somatic.vcf.gz.stats \

    --filtering-stats tumor2.f1.vcf.gz.stats \

    --contamination-table ~/mydata/03.gatk/contamination/tumor2.contamination.table \

    --tumor-segmentation tumor2.segments.tsv \

    --ob-priors tumor2.read-orientation-model.tar.gz \

    -O tumor2.f1.vcf.gz \

     

     

    gatk FilterMutectCalls \

    -R ~/mydata/genome/hg38/hg38-genomefa/Homo_sapiens_assembly38.fasta\

    -V ~/mydata/03.gatk/b.SNP_Indel/tumorX.unfiltered.somatic.vcf.gz \

    --stats ~/mydata/03.gatk/b.SNP_Indel/tumorX.unfiltered.somatic.vcf.gz.stats \

    --filtering-stats tumorX.f1.vcf.gz.stats \

    --contamination-table ~/mydata/03.gatk/contamination/tumorX.contamination.table \

    --tumor-segmentation tumorx.segments.tsv \

    --ob-priors tumorX.read-orientation-model.tar.gz \

    -O tumor2.f1.vcf.gz \

     

    What I have did is right? So, I don’t know what “When multiple tumor samples are specified, you only need a single --f1r2-tar-gz output, which contains data for each tumor sample.” mean. I need your help. Hope your reponse. Thank you

     

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Are your tumor samples all from the same individual or from different individuals? 

    0
    Comment actions Permalink
  • Avatar
    chenglei

    my tumor sample are from different individuals

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Ok! The sentence you were referring to: "When multiple tumor samples are specified, you only need a single --f1r2-tar-gz output, which contains data for each tumor sample" only applies if you are running Mutect2 in multisample mode, which is when you have multiple samples from the same individual. 

    Since you have different individuals, you will run Mutect2 separately for each sample and LearnReadOrientationModel separately for each sample. 

    Please let me know if I can clarify this further for you or if you have any remaining questions.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk