These errors occur when the names or sizes of contigs don't match between input files. This is a classic problem that typically happens when you get some files from collaborators, you try to use them with your own data, and GATK fails with a big fat error saying that the contigs don't match.
The first thing you need to do is find out which files are mismatched, because that will affect how you can fix the problem. This information is included in the error message, as shown in the examples below. You'll notice that GATK always evaluates everything relative to the reference. For more information about that see the Glossary entry on reference genomes.
Contents
- BAM file contigs not matching the reference
- VCF file contigs not matching the reference
BAM file contigs not matching the reference
A very common case we see looks like this:
##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: Found contigs with the same name but different lengths: ##### ERROR contig reads = chrM / 16569 ##### ERROR contig reference = chrM / 16571. ##### ERROR reads contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM] ##### ERROR reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249]
First, the error tells us that the mismatch is between the file containing reads, i.e. our BAM file, and the reference:
Input files reads and reference have incompatible contigs
It further tells us that the contig length doesn't match for the chrM contig:
Found contigs with the same name but different lengths: ##### ERROR contig reads = chrM / 16569 ##### ERROR contig reference = chrM / 16571.
This can be caused either by using the wrong genome build version entirely, or using a reference that was hacked from a build that's very close but not identical, like b37 vs hg19, as detailed a bit more below.
We sometimes also see cases where people are using a very different reference; this is especially the case for non-model organisms where there is not yet a widely-accepted standard genome reference build.
Note that the error message also lists the content of the sequence dictionaries that it found for each file, and we see that some contigs in our reference dictionary are not listed in the BAM dictionary, but that's not a problem. If it was the opposite, with extra contigs in the BAM (or VCF), then GATK wouldn't know what to do with the reads from these extra contigs and would error out (even if we try restricting analysis using -L
) with something like this:
#### ERROR MESSAGE: BAM file(s) do not have the contig: chrM. You are probably using a different reference than the one this file was aligned with.
Solution
If you can, simply switch to the correct reference. Note that file names may be misleading, as people will sometimes rename files willy-nilly. Sometimes you'll need to do some detective work to identify the correct reference if you inherited someone else's sequence data.
If that's not an option because you either can't find the correct reference or you absolutely MUST use a particular reference build, then you will need to redo the alignment altogether. Sadly there is no liftover procedure for reads. If you don't have access to the original unaligned sequence files, you can use Picard tools to revert your BAM file back to an unaligned state (either unaligned BAM or FASTQ depending on the workflow you wish to follow).
Special case of b37 vs. hg19
The b37 and hg19 human genome builds are very similar, and the canonical chromosomes (1 through 22, X and Y) only differ by their names (no prefix vs. chr prefix, respectively). If you only care about those, and don't give a flying fig about the decoys or the mitochondrial genome, you could just rename the contigs throughout your mismatching file and call it done, right?
Well... This can work if you do it carefully and cleanly -- but many things can go wrong during the editing process that can screw up your files even more, and it only applies to the canonical chromosomes. The mitochondrial contig is a slightly different length (see error above) in addition to having a different naming convention, and all the other contigs (decoys, herpes virus etc) don't have direct equivalents.
So only try that if you know what you're doing. YMMV.
VCF file contigs not matching the reference
ERROR MESSAGE: Input files known and reference have incompatible contigs: Found contigs with the same name but different lengths: ERROR contig known = chrM / 16569 ERROR contig reference = chrM / 16571.
Yep, it's just like the error we had with the BAM file above. Looks like we're using the wrong genome build again and a contig length doesn't match. But this time the error tells us that the mismatch is between the file identified as known and the reference:
Input files known and reference have incompatible contigs
In this case the error was output by a tool that takes a VCF file of known variants provided through the known
argument, so this makes sense and tells us which file is at fault. Depending on the tool, the way the file is identified may vary, but the logic should be fairly obvious.
Solution
If you can, you find a version of the VCF file that is derived from the right reference. If you're working with human data and the VCF in question is just a common resource like dbsnp, you're in luck -- we make sets of suitable resources available for the supported reference builds. If you're working on your own installation of GATK, you can get these from the Resource Bundle. If you're using GATK on Terra, our cloud-based analysis platform, the featured GATK workspaces are preloaded with the appropriate resources.
If that's not an option, then you'll have to "liftover" -- specifically, liftover the mismatching VCF to the reference you need to work with. The best tool for liftover is Picard's LiftoverVCF. We provide several chain files to liftover between the major human reference builds, also in our resource bundle in the Liftover_Chain_Files
directory. If you are working with non-human organisms, we can't help you -- but others may have chain files, so ask around in your field.
GATK used to include some liftover utilities but we no longer support them.
8 comments
Dear GATK group,
I have encounter a similar (if not the same) problem running the Mutect2 program.
It says "A USER ERROR has occurred: Input files reference and features have incompatible contigs: No overlapping contigs found.
reference contigs = [NC_000001.11, NT_187361.1, NT_187362.1, NT_187363.1, NT_187364.1, NT_187365.1, NT_187366.1, NT_187367.1, NT_187368.1, NT_187369.1, NC_000002.12, NT_187370.1,.........(where I omitted many more items)
features contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, chr1_KI270706v1_random, chr1_KI270707v1_random, chr1_KI270708v1_random, chr1_KI270709v1_random, chr1_KI270710v1_random, chr1_KI270711v1_random, chr1_KI270712v1_random,........"
I figured that the mismatch is between the ref and VCF files (1000g_pon.hg38.vcf.gz and somatic-hg38_af-only-gnomad.hg38.vcf from https://console.cloud.google.com/storage/browser/gatk-best-practices/somatic-hg38/
and Ref file (The unpatched grch38 assembly from NCBI https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/)
If I run the program without the pon and germline source files, the program works smoothly.
I wonder if the inconsistency is caused by the out-dated grch38 assembly and whether it can be solved by using the uptodate grch38 patch 13.
Many thanks.
As a quick update, align with the latest grch38patch13 didn't solve the above-mentioned problem.
The solution stated above is completely failed.
I generate the BAM file from a certain hg38 reference sequence using bwa.
Then I call Mutect2 done on the generated BAM and the same hg38 reference.
With the same source of hg38 reference, how would there be difference in naming of contigs?
How come I can't use the software smoothly?
I have tried using different reference sequences (UCSC vs Reqseq) and difference sources of different germline resources, and either I get
A USER ERROR has occurred: Input files reference and features have incompatible contigs: No overlapping contigs found.
Or
A USER ERROR has occurred: An index is required but was not found for file /XXX/XXX/XXXX.vcf.gz. Support for unindexed block-compressed files has been temporarily disabled. Try running IndexFeatureFile on the input.
How come such a extensively developed and maintained software will have such a bug that I can't even run a simple Mutect2 program as an initial small test?
Even I only provided the input BAM files and reference genome, without providing the germline resources, the Mutect2 program can't even produce a vcf file
Got the same problem, tried different references, did not solve the issue.
frustrated at GATK
Can any one introduce another software to call somatic mutations (small indels and point mutations) that runs without so many bugs and errors?
my reads are aligned sliced sequences from ICGC (aligned to GRCH37)
Please, any help will be appreciated
Hello!
Is there any way to test the compatability of contigs between a bam file and a reference genome fasta file before the analysis. In case, we are not sure if the bam and reference genome match?
Dear all,
I got some bam files from a collaborator and I would like to use the GATK workflow from the BaseRecalibrator tool.
Its reference sequence had some control sequences like PhiX that I lacked from the reference sequence I will use later. Hence, they difer just in 2 "@SQ" lines:
Bam file header from gathered bam files:
What actually I am interested for:
In order to avoid mapping again fastq sequences, is there a way to edit the bam header in order to obtain a valid bam file without the reference sequences hsd37d5 and phiX174?
Thank you very much in advance.
Please sign in to leave a comment.