Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GRCh37 hg19 b37 humanG1Kv37 - Human Reference Discrepancies Follow

9 comments

  • Avatar
    W Maier

    It seems there are at least two errors in the comparison table on this page:

    1) The MD5 sum of GRCh37 Y is NOT identical to that of hg19 chrY. Instead it's 1fa3474750af0948bdf97d5a0ee52e51, i.e., identical to the one you list for HumanG1Kv37 and b37.

    The difference between the two versions is that the GRCh37 version has a lot more N-masked bases at both ends of the Y chromosome than hg19. The non-masked intersect is sequence-identical.

    2) The names of all primary assembled chromosomes in GRCh37 (including the sex chromosomes and the mitochondrial genome) have NO chr prefix, i.e., those names are identical to those used in HumanG1Kv37 and b37.

    These observations are based on GRCh37 downloaded from ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/ and together make GRCh37 more similar to HumanG1Kv37 and b37 than suggested by the current table.

    ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.toplevel.fa.gz

    0
    Comment actions Permalink
  • Avatar
    Maximilianh

    One of the genomes above, "GRCh37"  is at patch level 13, unlike the other three genomes, where you used the original release. This explains most of the differences that you found. Can you tell us where you downloaded the four files, the exact URLs ? Also, what operations did you run on these files, I imagine that you converted them to all uppercase, to remove the soft masking?

    The only Google hit for the GRCh37 filename is ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/GRCh37.p13.genome.fa.gz. Its MD5 sum matches yours, so this is at least identical to the file. 

    For this file, we just ran a diff against UCSC's chr3 and chrY and can only find a single difference, the sequence identifier, which is "chr3" / "chrY" at UCSC and "CHR3" / "CHRY" for the Gencode file. Is it possible that there is a bug in the script that created this table?

    This reduces the actual differences to only chrM, which is documented by UCSC (hg19 was released before the "official" chrM was chosen. UCSC will most likely add a chrMT sequence for compatibility with the other genome versions.)

    As for Ensembl, depending on the exact URL, the Ensembl files are not the same as the GRC sequence. Ensembl pads the alternates with Ns to create full coordinate-compatible alternate chromosomes.

     

     

    0
    Comment actions Permalink
  • Avatar
    Maximilianh

    Sorry, I just saw that you did provide the URLs! Never mind my first question.

    Using these URLs, I cannot reproduce your full-file md5sums. The md5 of GCF_000001405.25_GRCh37.p13_genomic.fna at NCBI does not match the one in this post and the md5 of hg19.fa is also different.

    wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz -O - | zcat | md5sum

    530d89d3ef07fdb2a9b3c701fb4ca486

    wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.fna.gz  -O - | zcat | md5sum

    fbd575486dfa3b94d7e9bab87afa1c90

    I tried md5'ing the gzipped files, but that didn't match either.

    2
    Comment actions Permalink
  • Avatar
    Jonn Smith

    @W_Maier - Responses follow, but briefly - you seem to be using a different file for your comparisons than the ones I used above.

    1)  The comparisons are correct as posted, and are derived directly from the sequence dictionaries for each fasta file.

    The sequence dictionary for the GRCh37 file I used (as detailed above) contains the following sequence information for chrY:

    SQ SN:chrY LN:59373566 M5:1e86411d73e6f00a10590f976be01623 UR:file:/GRCh37.p13.genome.fasta

    This is not identical to the MD5Sum you specified, however it does correctly correspond to that in the table.

     

    Plenty of people have told me that the difference is in masking bases, but no one has proved it.  It has been very low on my to-do list since writing this original post, but I haven't had time to do it.  It is not that I don't believe you, rather it is that I want to know exactly what the differences are.

     

    2)

    The sequence dictionary for the GRCh37 file used indicates this is not the case:

    @HD VN:1.5
    @SQ SN:chr1 LN:249250621 M5:1b22b98cdeb4a9304cb5d48026a85128 UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chr2 LN:243199373 M5:a0d9851da00400dec1098a9255ac712e UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chr3 LN:198022430 M5:641e4338fa8d52a5b781bd2a2c08d3c3 UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chr4 LN:191154276 M5:23dccd106897542ad87d2765d28a19a1 UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chr5 LN:180915260 M5:0740173db9ffd264d728f32784845cd7 UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chr6 LN:171115067 M5:1d3a93a248d92a729ee764823acbbc6b UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chr7 LN:159138663 M5:618366e953d6aaad97dbe4777c29375e UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chr8 LN:146364022 M5:96f514a9929e410c6651697bded59aec UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chr9 LN:141213431 M5:3e273117f15e0a400f01055d9f393768 UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chr10 LN:135534747 M5:988c28e000e84c26d552359af1ea2e1d UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chr11 LN:135006516 M5:98c59049a2df285c76ffb1c6db8f8b96 UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chr12 LN:133851895 M5:51851ac0e1a115847ad36449b0015864 UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chr13 LN:115169878 M5:283f8d7892baa81b510a015719ca7b0b UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chr14 LN:107349540 M5:98f3cae32b2a2e9524bc19813927542e UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chr15 LN:102531392 M5:e5645a794a8238215b2cd77acb95a078 UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chr16 LN:90354753 M5:fc9b1a7b42b97a864f56b348b06095e6 UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chr17 LN:81195210 M5:351f64d4f4f9ddd45b35336ad97aa6de UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chr18 LN:78077248 M5:b15d4b2d29dde9d3e4f93d1d0f2cbc9c UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chr19 LN:59128983 M5:1aacd71f30db8e561810913e0b72636d UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chr20 LN:63025520 M5:0dec9660ec1efaaf33281c0d5ea2560f UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chr21 LN:48129895 M5:2979a6085bfe28e3ad6f552f361ed74d UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chr22 LN:51304566 M5:a718acaa6135fdca8357d5bfe94211dd UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chrX LN:155270560 M5:7e0e2e580297b7764e31dbc80c2540dd UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chrY LN:59373566 M5:1e86411d73e6f00a10590f976be01623 UR:file:/GRCh37.p13.genome.fasta
    @SQ SN:chrM LN:16569 M5:c68f52674c9fb33aef52dcf399755519 UR:file:/GRCh37.p13.genome.fasta
    0
    Comment actions Permalink
  • Avatar
    Jonn Smith

    As an aside, the discussion here is exactly why I wanted to document these differences - everyone seems to have their favorite HG19 / GRCh37 assembly and they're not always 100% compatible.  This is particularly true for sequence dictionary based checks and has led to a lot of problems in practice.

    0
    Comment actions Permalink
  • Avatar
    Maximilian Haeussler

    Jonn, yes, of course they are not the same in every way, but when we checked after your post, the primary chromosome sequences were identical, contrary to what you found for chr3 and chrY.

    Can you tell us how got the MD5 of 1e86411d73e6f00a10590f976be01623 for chrY and also the MD5 of chr3 ? We were unable to recreate these.

    Sorry, I don't know what you mean with "sequence dictionary", to me, sequences come as .fasta files.

    Another question: how did you obtain the MD5s of the input files that you report? I copied my Unix commands above where I try to check if we are looking at the same files and got different MD5s for the same files than you got, notably the hg19.fa.gz file from UCSC and the NCBI file ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.fna.gz%C2%A0%C2%A0-O How did you obtain your MD5s for the full files?

    0
    Comment actions Permalink
  • Avatar
    Jonn Smith

    Maximilian Haeussler Ah.  I just looked and it looks like the link I put for GRCh37.p13 is not correct for the file I have.  I will have to look into where the copy I have came from.

    As for the sequence dictionary - A sequence dictionary is a file that indicates all the sequences that are contained in a FASTA file.  For tools in the GATK, we usually require a sequence dictionary and a FASTA index file to work with a reference.  This is so we can randomly access the FASTA file and provide interval-based operations.  The sequence dictionaries I refer to were created with the `CreateSequenceDictionaryTool` by the following:

    gatk CreateSequenceDictionary -R REFERENCE.FASTA

    This tool looks at each sequence name in the file, then takes an md5sum of the sequence itself and records this information in plain text forma (I posted a large portion of a sequence dictionary above).

    I looked at your comand-line invocations and that's basically what I did (I just unzipped the files to disk first).  If the files aren't the same that would explain the discrepancies.  I've made a note to compare the reference I've erroneously linked to with the others noted here.  I'm checking the ucsc.hg19 file now as well.

    There seems to be a problem with provenance for certain copies of the reference that we're keeping around, so I'll try to track down where the discrepancies arose.

    Not sure when I will get around to cleaning it up, but probably not for a week or two.

     

    0
    Comment actions Permalink
  • Avatar
    WangZiwei

    I download hg19 reference genomes respectively from GATK resource bundle and UCSC http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/

    the md5sum of hg19 from GATK resource bundle: a244d8a32473650b25c6e8e1654387d6 (same as the article posted)
    the md5sum of hg19 from UCSC website : 530d89d3ef07fdb2a9b3c701fb4ca486 (same as Maximilianh posted)

    I wonder if you have found out where the discrepancies arose.

    0
    Comment actions Permalink
  • Avatar
    Jonn Smith

    It's still something I want to do, but I haven't been able to follow-up yet.  I'll update this post as soon as I do.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk