User error: input files reference and features have incompatible contigs
Can you please provide
a) GATK version used
b) Exact GATK commands used
c) The entire error log if applicable.
Hello,
I am very new to bioinformatics and I have a project to create a variant calling pipeline using Snakemake. I am finding it difficult to use the BaseRecalibrator tool.
This is my rule:
rule recalibrate:
input:
genome = "data/external/genome.fa",
bam = expand("../../data/interim/mapped_reads/{sample}_marked_duplicates.bam",
sample=config["samples"]),
known = "/data/external/dbsnp.vcf"
output:
"../../data/interim/mapped_reads/{sample}_recal.table"
conda:
"../../environment.yml"
shell:
"gatk BaseRecalibrator -R {input.genome} "
"-I {input.bam} -O {output} --known-sites {input.known}"
I need to use the following resources for my data:
reads - ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7035_TAAGGCGA_L002_R1_001_trimmed.fastq.gz, ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7035_TAAGGCGA_L002_R2_001_trimmed.fastq.gz
for the known sites I downloaded the dbsnp_138.b37.vcf.gz file from the bundle from the documentation here. However, I keep getting the following error:
A USER ERROR has occurred: Input files reference and features have incompatible contigs: No overlapping contigs found.
I don't understand, can I not use my data and then recalibrate it using the GATK tool? I am not allowed to change the genome and reads data samples.
the whole log:
Using GATK jar ../evaluating_the_performance_of_variant_calling_pipelines/src/pipelines/.snakemake/conda/8660692a/share/gatk4-4.1.4.1-1/gatk-package-4.1.4.1-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar../evaluating_the_performance_of_variant_calling_pipelines/src/pipelines/.snakemake/conda/8660692a/share/gatk4-4.1.4.1-1/gatk-package-4.1.4.1-local.jar BaseRecalibrator -R ../evaluating_the_performance_of_variant_calling_pipelines/data/external/genome.fa -I ../../data/interim/mapped_reads/A_marked_duplicates.bam -O ../../data/interim/mapped_reads/B_recal.table --known-sites /h../evaluating_the_performance_of_variant_calling_pipelines/data/external/dbsnp.vcf
12:28:19.021 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:../evaluating_the_performance_of_variant_calling_pipelines/src/pipelines/.snakemake/conda/8660692a/share/gatk4-4.1.4.1-1/gatk-package-4.1.4.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
Feb 14, 2020 12:28:19 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
12:28:19.251 INFO BaseRecalibrator - ------------------------------------------------------------
12:28:19.251 INFO BaseRecalibrator - The Genome Analysis Toolkit (GATK) v4.1.4.1
12:28:19.251 INFO BaseRecalibrator - For support and documentation go to https://software.broadinstitute.org/gatk/
12:28:19.251 INFO BaseRecalibrator - Executing as name@name-B360M-DS3H on Linux v4.15.0-76-generic amd64
12:28:19.251 INFO BaseRecalibrator - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_152-release-1056-b12
12:28:19.251 INFO BaseRecalibrator - Start Date/Time: 14 February 2020 12:28:19 GMT
12:28:19.251 INFO BaseRecalibrator - ------------------------------------------------------------
12:28:19.251 INFO BaseRecalibrator - ------------------------------------------------------------
12:28:19.251 INFO BaseRecalibrator - HTSJDK Version: 2.21.0
12:28:19.251 INFO BaseRecalibrator - Picard Version: 2.21.2
12:28:19.252 INFO BaseRecalibrator - HTSJDK Defaults.COMPRESSION_LEVEL : 2
12:28:19.252 INFO BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
12:28:19.252 INFO BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
12:28:19.252 INFO BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
12:28:19.252 INFO BaseRecalibrator - Deflater: IntelDeflater
12:28:19.252 INFO BaseRecalibrator - Inflater: IntelInflater
12:28:19.252 INFO BaseRecalibrator - GCS max retries/reopens: 20
12:28:19.252 INFO BaseRecalibrator - Requester pays: disabled
12:28:19.252 INFO BaseRecalibrator - Initializing engine
12:28:19.625 INFO FeatureManager - Using codec VCFCodec to read file file:/../evaluating_the_performance_of_variant_calling_pipelines/data/external/dbsnp.vcf
12:28:19.738 INFO BaseRecalibrator - Shutting down engine
[14 February 2020 12:28:19 GMT] org.broadinstitute.hellbender.tools.walkers.bqsr.BaseRecalibrator done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=641204224
***********************************************************************
A USER ERROR has occurred: Input files reference and features have incompatible contigs: No overlapping contigs found.
reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15,
-
Official comment
Hi Jyotsana Mehra
1. The fastest way to find what you are looking for is to look in our documentation first. We have generated an extensive list of documentation to help the community and most of your questions can be answered there. For example the resources you are looking for can be found in the resource bundle document here: https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle
2. Another importance thing to do is to search the forum to see if your question has been answered already.
Please read this article we created about how to find what you are looking for on our website: https://gatk.broadinstitute.org/hc/en-us/articles/360053424591
Comment actions -
mons7re Your reference is GRCh38, but your dbSNP (the "features") is for the b37 reference. Assuming that your reads are aligned to the GRCh38 reference, you can fix the problem by using a GRCh38 (often this is called hg38) version of dbSNP.
The GRCh38 reference is the successor to b37. It differs from b37 mainly in terms of completeness -- fewer gaps in repetitive regions like telomeres and centromeres -- and also contains so-called "alt contigs", which. . . well, maybe that would be too much information overload for now.
By the way, most of GATK developers did not come to the Broad Institute with any background in biology. We all remember getting tripped up on things like this when we started. It happens to everyone.
-
From where to get the reads?
I have all the desired labels for the reference contigs but my read contigs list is empty.
What am I missing?
-
-
Thanks, I looked into the issue, there is something wrong with the reference genome which I am using. Can you please provide a valid reference genome for the GATK pipeline?
-
Hi Jyotsana Mehra, please look at our resource bundle page, where you will find information about the resources we provide:
https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle
-
Hi
I'm very new to the bioinformatics and trying to make gvcf from .BAM files and for performing the same I have used hg19.fa files from NCBI database . when I run the command then I face this error . A USER ErROR has occurred: Input files reference and reads have incompatible contigs: No overlapping contigs found.
reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chrX, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr20, chrY, chr19, chr22, chr21, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl000233, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl000241, chrUn_gl000236, chrUn_gl000240, chr17_gl000206_random, chrUn_gl000232, chrUn_gl000234, chr11_gl000202_random, chrUn_gl000238, chrUn_gl000244, chrUn_gl000248, chr8_gl000196_random, chrUn_gl000249, chrUn_gl000246, chr17_gl000203_random, chr8_gl000197_random, chrUn_gl000245, chrUn_gl000247, chr9_gl000201_random, chrUn_gl000235, chrUn_gl000239, chr21_gl000210_random, chrUn_gl000231, chrUn_gl000229, chrM, chrUn_gl000226, chr18_gl000207_random]
reads contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT]Should I change the CONTIGS name in .BAM fileor change the reference genome file? Kindly reply.
-
Hi sarita 693,
This error indicates that your BAM file was aligned to a different reference than the one you are using. In this situation it's generally not safe to simply change the contig names -- instead, you should find out which reference your BAM file was aligned to, and use that reference instead. This information is sometimes present in the header of your BAM file, which you can view with "samtools view -H". If it's not there, you'll need to look into how your BAM files were created, and which reference was used during alignment.
Hope this helps,David
Please sign in to leave a comment.
8 comments