Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

BaseRecalibrator SNP databases

0

2 comments

  • Avatar
    danilovkiri

    Hi.

    The cornerstone of genomic data pipelines is that all the data both the external and the produced one MUST have the same reference dictionary, i.e. the contig names must be absolutely identical. Any genome build can use several contig naming schemes: chr1, 1 or accessions like NC_000001.11 and etc. There are three options for you:

    1. Look into the reference genome fasta file you used for alignment and find out which chromosome naming scheme does it use. Personally, I prefer chrN scheme as it speaks for itself. In your case, it seems like the resource VCF files for BQSR have chrN naming scheme, so you have to rename the chrN contigs to make them NC_00000whatever-like. This can be done with `bcftools annotate --rename-chrs` option (http://samtools.github.io/bcftools/bcftools.html#annotate). You have to provide a TSV file with two columns where the first column is the accession you have in your data, and the second column is the target name for that accession. 

    2. Rename the accession names in the reference FASTA file. It can be done manually with `sed` in bash.

    3. Since you have spent much cpu time for alignment I presume the previous option is not preferable, thus I suggest to rename the accessions in BAM files (https://www.biostars.org/p/13462/). Note that accessions must be renamed both in the BAM header and body. 

    Final note: always check the chromosome naming schemes in the data you use. It happens so that most of the external resources use chrN-like naming scheme, I suggest you use it as well in the future. In case you might find it helpful, look at the "Global Assembly Definition" section at https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/#/def_asm_Primary_Assembly for GRCh38. It has a mapping of NC-like accessions to conventional chromosomes. I guess you'll see the logic beyond. By the way, NC-like accession numbers for GRCh37 and GRCh38 differ with the last digit.

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi danilovkiri

     

    Thank you so much for your response and contribution to building the knowledge-base of this forum. That was a very detailed and good explanation. I will refer other users to this post when such a question comes up again! 

    GATK team thanks you for your contribution!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk