Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Data pre-processing for variant discovery - BaseRecalibrator, vcf input?

Answered
0

8 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi Arik, which files do you have and which are you missing? Please provide the specific command

    0
    Comment actions Permalink
  • Avatar
    Arye Harel

    Hi Genevieve, 

    Thank you for your fast response.
    I am missing a vcf file.
    These are the main stages and files I already ran:

    1. Task: QA
    Software: fastqc
    Input:   individualA_f.fq.gz , individualA_r.fq.gz derived from illumina Hiseq
    Result: 2 html files showing sequences with duplicates and adapters.


    2. Task: remove low QA sequences, and adapters.
    Software: Trimmomatic 
    input:  individualA_f.fq.gz , individualA_r.fq.gz
    output: individualA_f_paired.fq.gz , individualA_r_paried.fq.gz
                 individualA_f_unpaired.fq.gz , individualA_r_unparied.fq.gz

    ---- From here on protocol is based on GATK  ---

    3. Mapping (based on GATK)

    3A. Indexing:

    Command: bwa index -a bwtsw Genome.fasta
    input:  PlantGenome.fasta
    output: indexing files for genome 

    3B. Mapping:

    Command: bwa mem -M -t 14 PlantGenome.fasta individualA_f_paired.fq , individualA_r_paried.fq > individualA_paired.sam

    input:  see in command
    output: individualA_paired.sam

    4. Task: Mark Duplicates

    4a. sort
    Command: picard.jar SortSam I=individualA_paired.sam  O=individualA_paired_sorted.sam SORT_ORDER=coordinate

    input: 
    see in command
    output: individualA_paired_sorted.sam

    4b. markduplicates
    Command: picard.jar MarkDuplicates  I=individualA_paired_sorted.sam O=individualA_paired_markDup.sam ASSUME_SORT_ORDER=coordinate M=individualA_markDuplic_matrix

    input:  see in command
    output: individualA_paired_markDup.sam, individualA_markDuplic_matrix

    4c. sort again

    Same command as in 4a:
    input:  individualA_paired_markDup.sam
    output: individualA_paired_markdup_ReSorted.sam

     

    5. The next staage according to https://gatk.broadinstitute.org/hc/en-us/articles/360035535912-Data-pre-processing-for-variant-discovery is:

    Base (Quality Score) Recalibration done with BaseRecalibrator and Apply Recalibration.

    However BaseRecalibrator requires this argument:
    --known-sites sites_of_variation.vcf

    However, I still don't have a vcf file at this stage of the protocol.

    In other words: how do I make the  sites_of_variation.vcf required by the --known-sites argument? In which stage it should have been created?

    What am I am missing here?

     

    Thank you very much for your hep,

     

    Arik

    0
    Comment actions Permalink
  • Avatar
    Arye Harel

    One more question:

    Looking at "https://gatk.broadinstitute.org/hc/en-us/articles/360035535912-Data-pre-processing-for-variant-discovery", it reads:

    "Tools involved: BaseRecalibrator, Apply Recalibration, AnalyzeCovariates (optional)"

    Does it mean the whole "Base (Quality Score) Recalibration" stage is  an optional stage? Or  just the AnalyzeCovariates stage?

    If the whole stage is optional when does one removes duplicates? in the MarkDuplicates stage?

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hello Arik, the Base Quality Score Recalibration (BQSR) step is recommended to improve results, but not required to run an analysis using GATK. You need at least 100M bases per read group to get high quality results. Here is an overview in our documentation: Base Quality Score Recalibration (BQSR)

    The file: --known-sites sites_of_variation.vcf is not created in your workflow, it is a VCF file containing the variants that have been confirmed for the species you are studying. Here is more information about where you can find those files: Where can I find known variants, training and truth sets, and other resource files

    0
    Comment actions Permalink
  • Avatar
    Arye Harel

    Dear Genevieve,

    Thank you very much for the helpful response.

    Arik

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Glad we were able to help!

    0
    Comment actions Permalink
  • Avatar
    Peiwen Li

    Hi,

    I have a similar situation as Arye; I am following the GATK best practices workflow for my non-human organism's data, and I am now in Step Base Quality Score Recalibration (BQSR), and yet the sites_of_variation.vcf file is not available. 

    I wonder if I should skip the BQSR step entirely? However, in the BaseRecalibrator description page, it says:

    "These Read Filters are automatically applied to the data by the Engine before processing by BaseRecalibrator".

    If I need to skip the BQSR step entirely, should I still apply these filters to remove duplicated and unmapped reads from my bam files before the variant calling step?

    Thank you!

    Peiwen 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Peiwen Li

    You don't need to apply those read filters without BQSR, many of those filters are applied with other steps in the best practices workflow.

    Best,

    Genevieve

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk