This is not as common as the "wrong reference build" problem, but it still pops up every now and then: a collaborator gives you a BAM or VCF file that's derived from the correct reference, but for whatever reason the contigs are not sorted in the same order. The GATK can be particular about the ordering BAM and VCF files, so it will fail with an error in this case.
So what do you do?
For BAM files
You run Picard's ReorderSam tool on your BAM file, using the reference genome dictionary as a template, like this:
java -jar picard.jar ReorderSam \ I=original.bam \ O=reordered.bam \ R=reference.fasta \ CREATE_INDEX=TRUE
Where reference.fasta
is your genome reference, which must be accompanied by a valid *.dict
dictionary file. The CREATE_INDEX
argument is optional but useful if you plan to use the resulting file directly with GATK (otherwise you'll need to run another tool to create an index).
Be aware that this tool will drop reads that don't have equivalent contigs in the new reference (potentially bad or not, depending on what you want). If contigs have the same name in the BAM and the new reference, this tool assumes that the alignment of the read in the new BAM is the same. This is not a liftover tool!
For VCF files
You run Picard's SortVcf tool on your VCF file, using the reference genome dictionary as a template, like this:
java -jar picard.jar SortVcf \ I=original.vcf \ O=sorted.vcf \ SEQUENCE_DICTIONARY=reference.dict
Where reference.dict
is the sequence dictionary of your genome reference.
Note that you may need to delete the index file that gets created automatically for your new VCF by the Picard tool. GATK will automatically regenerate an index file for your VCF.
1 comment
Hello GATK,
I realised I was having this issue with the CNV caller tools... but only after the most expensive step that took 10 days on a 256-threaded machine. I don't much want to repeat that if not absolutely necessary. So my question is: Can I somehow edit the order after the fact? The edit would presumably be done on the input files to PostProcessGermlineCalls... I have HDFviewer but am not sure how to make edits with it.
PS. When I say I have this issue, I'm referring to these lines:
However, the output VCF file itself looks just fine - the chromosome order may be wack, but there are calls and things in it, as expected. So perhaps the problem is just that ProgressMeter fails when the chromosome order doesn't match the reference fasta (dict) file?
Please sign in to leave a comment.