Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Base Quality Score Recalibration (BQSR) Follow

13 comments

  • Avatar
    Nickier

    How to set a COMPRESSION_LEVEL of ApplyBQSR, I found that the output  bam file is twice the size of the original bam file while the the original bam is COMPRESSION_LEVEL=2

     
    0
    Comment actions Permalink
  • Avatar
    cali

    gatk ApplyBQSR \
    --java-options "-Xmx6G -Dsamjdk.compression_level=5" \
    -R $ref \
    -I $bam_in \
    --bqsr-recal-file $table \
    -L $contig \
    -O $bam_out

    0
    Comment actions Permalink
  • Avatar
    Yiguan Wang

    Currently working on Drosophila genomes, there isn't a known list of variants. Just wondering how to perform bootstrap to generate a set of known variants, is there a pipeline about that? Thanks in advance!

    4
    Comment actions Permalink
  • Avatar
    Adrián Segura

    Hi I have a question. In the description of this process, the BaseRecalibrator tool requires databases of known polymorphisms to recalibrate the quality of the bases. As explained in this document, any changes with respect to these references (dbSNP, gnomAD, ...) are considered an error, is this not counterproductive for the detection of somatic variants in tumor samples? Shouldn't I then provide in BaseRecalibrator also data from COSMIC or some other specialized databases on somatic mutations?

    2
    Comment actions Permalink
  • Avatar
    Joanna

    Hi all, 

    I have a question if in the case of Canis lupus familiaris (DOG) the BQSR is needed?

    Thanks in advance!

    Joanna

    0
    Comment actions Permalink
  • Avatar
    Sophie Agger

    Joanna yes, it's not related to species.

    0
    Comment actions Permalink
  • Avatar
    Sophie Agger

    I've had a technical issue with this tool. If your disk is full, it doesn't throw an error, but just keeps chugging along. In most cases you'd notice this due to lack of EOL, but in theory this could lead to a truncated bam-file where you can't see that it's truncated, plus it's a lot of work to fix manually. Is this a known bug or is it just something I'll have to live with?

    0
    Comment actions Permalink
  • Avatar
    Conrad Leonard

    Adrián Segura I think the assumption is that for most tumours the number of positions affected by somatic variation is negligible compared to the total size of the genome so they won't affect the bulk statistics much. But I do wonder about tumours with high TMB and especially those with a distinctive mutational signature e.g. UV for melanoma, where somatic variation is highly correlated with base context. One could imagine in that case for some bins that a non-negligible proportion of the 'error' in the bin is real variation, which would lead to improper downwards base quality recalibration at the exact sites where you want to call. Geraldine Van der Auwera is there guidance on this from GATK team? Maybe we could do some experiments...

    2
    Comment actions Permalink
  • Avatar
    Sheryl

    Conrad Leonard did you get any response to this? I think it's an interesting point and one which I would like to get advice on from the GATK team

    0
    Comment actions Permalink
  • Avatar
    Conrad Leonard

    Sheryl  sorry no, I didn't get a response here or through other channels. Still interested in the answer though...

    0
    Comment actions Permalink
  • Avatar
    Sheryl

    Another consideration of course is that these databases are biased against any population that have little database representation e.g. indigenous populations

     

    1
    Comment actions Permalink
  • Avatar
    Caitlin Redak

    I am working on doing the BQSR pipeline and in my recalibration table it's saying there are 0 errors, regardless of numbers of observations or quality scoring.This seems impossible. Is there anything that would cause this problem? Or is it an issue with my data? 

    0
    Comment actions Permalink
  • Avatar
    Quanwei Zhang

    I wonder whether further processing is needed after download known sites from dbsnp or other database? For example, the variants with multiple allelic allele, or indels. I am not very sure whether they need to be normalized into certain format before input into GATK.  

    #examples from dbsnp
    1       10001   rs1570391677    T       A,C     .       .       RS=1570391677;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:0.9891,0.0109,.|SGDP_PRJ:0,1,.|dbGaP_PopFreq:1,.,0;COMMON
    1       10002   rs1570391692    A       C       .       .       RS=1570391692;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:0.9944,0.005597

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk