The GATK requires the reference sequence in a single reference sequence in FASTA format, with all contigs in the same file, validated according to the FASTA standard.
All standard IUPAC bases are accepted, while non-standard bases (i.e. other than ACGT, such as W, K, M, etc.) will be ignored, meaning that those positions in the genome will be skipped. Commonly used programs such as Picard and Samtools treat spaces in contig names differently, so please note that we recommend not using spaces in contig names if you are making your own genome reference.
Most GATK tools additionally require that the main FASTA file be accompanied by a dictionary file ending in .dict
and an index file ending in .fai
, because it allows efficient random access to the reference bases. GATK will look for these index files based on their name, so it is important that they have the same basename as the FASTA file. If you do not have these files available for your organism's reference file, you can generate them very easily; instructions are included below.
If you are working with human data, we recommend you use one of the reference genome builds that we provide in our Resource Bundle or in Terra, our cloud-based analysis portal. We currently support GRCh38/hg38 and b37 (and to a lesser extent, hg19). For more information on the human genome reference builds, see this document.
Common problems with reference files
The most common reference-related issue people encounter is an incompatibility between some of the data and/or resources that were derived from (or mapped to) different reference builds. Read more about that problem and how to solve it in this solutions doc.
Some people have also reported having issues with reference files that have been stored or modified on Windows filesystems. The issues manifest as "10" characters (corresponding to encoded newlines) inserted in the sequence, which cause the GATK to quit with an error. If you encounter this issue, you will need to re-download a valid master copy of the reference file, or clean it up yourself.
Instructions for generating the dictionary and index files
Creating the FASTA sequence dictionary file
We use the CreateSequenceDictionary tool to create a .dict
file from a FASTA file. Note that we only specify the input reference; the tool will name the output appropriately automatically.
gatk-launch CreateSequenceDictionary -R ref.fasta
This produces a SAM-style header file named ref.dict
describing the contents of our FASTA file.
@HD VN:1.5 @SQ SN:20 LN:63025520 M5:0dec9660ec1efaaf33281c0d5ea2560f UR:file:/Users/vdauwera/Desktop/germline_mini/ref/ref.fasta
Here we are using a tiny reference file with a single contig, chromosome 20 from the human b37 reference genome, that we use for demo purposes. If we were running on the full human reference genome there would be many more contigs listed.
Creating the fasta index file
We use the faidx
command in Samtools to prepare the FASTA index file. This file describes byte offsets in the FASTA file for each contig, allowing us to compute exactly where to find a particular reference base at specific genomic coordinates in the FASTA file.
samtools faidx ref.fasta
This produces a text file named ref.fasta.fai
with one record per line for each of the FASTA contigs. Each record is of the contig, size, location, basesPerLine and bytesPerLine. The index file produced above looks like this:
20 63025520 4 60 61
This shows that our FASTA file contains chromosome 20, which is 63025520 bases long, then the coordinates within the file which you do not need to care about.
4 comments
Dear GATK group,
I was trying to build a sequence dictionary file for the grch38 assembly downloaded from NCBI.
I first tried the
gatk-launch CreateSequenceDictionary -R ref.fasta
command, where I was told "gatk-launch" can't be found.
Instintively I tried without the -launch, this time the program CreateSequenceDictionary mode can be loaded but I encountered an error says,
Illegal argument value: Positional arguments were provided ',/home/field/shared/Genome_Refs/GCF_000001405.26_GRCh38_genomic.fna}' but no positional argument is defined for this tool.
I searched but failed to find a solution.
Thanks.
I have later on figured out that the dictionary can also be generated with the Picard-tools with the command
As mentioned by @Field -Ye Tian, "gatk-launch" can't be found with an normal version of GATK (in my case is GATK 3.8). Also it's not mentioned in '--help' stdout or GATK user's manual. Lucky that we still have the option to use CreateSequenceDictionary in Picard.
Thus, I'm wondering whether the "gatk-launch" module is an independent module from the GATK jar file, or it has already been removed from the GATK main program?
The page in this error message does not exist
"""
A USER ERROR has occurred: Fasta index file ...
Please see http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference for help creating it.
"""
Please sign in to leave a comment.