BaseRecalibrator SNP databases
Hello
I have been struggling with GATK 4 and feeling an acute lack of the command line pipelines available in previous versions. I took a look at the Tool Index and tried to make out the commands available from the https://github.com/gatk-workflows/gatk4-rnaseq-germline-snps-indels/blob/master/gatk4-rna-best-practices.wdl section
I have RNASeq data and I have used a GrCh38.p13 genomic reference not in the GATK resource bucket and currently I have processed around 600 samples with this same reference and followed till the Split n Cigar step with no problem
In the BaseRecalibrator step I get the following error :
a) GATK version
The Genome Analysis Toolkit (GATK) v4.1.7.0
HTSJDK Version: 2.21.2
Picard Version: 2.21.9
b) Exact GATK commands used
GATK BaseRecalibrator
-I input.bam
-R reference.fasta
--known-sites resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf
--known-sites resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf
--known-sites resources_broad_hg38_v0_Homo_sapiens_assembly38.known_indels.vcf
--known-sites resources_broad_hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf
-O recal_data.table
c) The entire error log if applicable.
A USER ERROR has occurred: Input files reference and features have incompatible contigs: No overlapping contigs found.
reference contigs = [NC_000001.11, NT_187361.1, NT_187362.1, NT_187363.1, NT_187364.1, NT_187365.1, NT_187366.1, NT_187367.1, NT_187368.1, NT_187369.1, NC_000002.12, NT_187370.1, NT_187371.1, NC_000003.12, NT_167215.1, NC_000004.12, NT_113793.3
I understand that the Reference is not matching to the dbSNP features
features contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, chr1_KI270706v1_random, chr1_KI270707v1_random, chr1_KI270708v1_random, chr1_KI270709v1_random, chr1_KI270710v1_random, .....
The dbSNP files were downloaded from the Resource Bundle.Since re-mapping at this stage is not preferable I am feeling hopeless is there a way out of this?
Please help. Thank you very much
-
Hi.
The cornerstone of genomic data pipelines is that all the data both the external and the produced one MUST have the same reference dictionary, i.e. the contig names must be absolutely identical. Any genome build can use several contig naming schemes: chr1, 1 or accessions like NC_000001.11 and etc. There are three options for you:
1. Look into the reference genome fasta file you used for alignment and find out which chromosome naming scheme does it use. Personally, I prefer chrN scheme as it speaks for itself. In your case, it seems like the resource VCF files for BQSR have chrN naming scheme, so you have to rename the chrN contigs to make them NC_00000whatever-like. This can be done with `bcftools annotate --rename-chrs` option (http://samtools.github.io/bcftools/bcftools.html#annotate). You have to provide a TSV file with two columns where the first column is the accession you have in your data, and the second column is the target name for that accession.
2. Rename the accession names in the reference FASTA file. It can be done manually with `sed` in bash.
3. Since you have spent much cpu time for alignment I presume the previous option is not preferable, thus I suggest to rename the accessions in BAM files (https://www.biostars.org/p/13462/). Note that accessions must be renamed both in the BAM header and body.
Final note: always check the chromosome naming schemes in the data you use. It happens so that most of the external resources use chrN-like naming scheme, I suggest you use it as well in the future. In case you might find it helpful, look at the "Global Assembly Definition" section at https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/#/def_asm_Primary_Assembly for GRCh38. It has a mapping of NC-like accessions to conventional chromosomes. I guess you'll see the logic beyond. By the way, NC-like accession numbers for GRCh37 and GRCh38 differ with the last digit.
-
Hi danilovkiri
Thank you so much for your response and contribution to building the knowledge-base of this forum. That was a very detailed and good explanation. I will refer other users to this post when such a question comes up again!
GATK team thanks you for your contribution!
Please sign in to leave a comment.
2 comments