Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GRCh37 hg19 b37 humanG1Kv37 - Human Reference Discrepancies Follow

3 comments

  • Avatar
    W Maier

    It seems there are at least two errors in the comparison table on this page:

    1) The MD5 sum of GRCh37 Y is NOT identical to that of hg19 chrY. Instead it's 1fa3474750af0948bdf97d5a0ee52e51, i.e., identical to the one you list for HumanG1Kv37 and b37.

    The difference between the two versions is that the GRCh37 version has a lot more N-masked bases at both ends of the Y chromosome than hg19. The non-masked intersect is sequence-identical.

    2) The names of all primary assembled chromosomes in GRCh37 (including the sex chromosomes and the mitochondrial genome) have NO chr prefix, i.e., those names are identical to those used in HumanG1Kv37 and b37.

    These observations are based on GRCh37 downloaded from ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/ and together make GRCh37 more similar to HumanG1Kv37 and b37 than suggested by the current table.

    ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.toplevel.fa.gz

    0
    Comment actions Permalink
  • Avatar
    Maximilianh

    One of the genomes above, "GRCh37"  is at patch level 13, unlike the other three genomes, where you used the original release. This explains most of the differences that you found. Can you tell us where you downloaded the four files, the exact URLs ? Also, what operations did you run on these files, I imagine that you converted them to all uppercase, to remove the soft masking?

    The only Google hit for the GRCh37 filename is ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/GRCh37.p13.genome.fa.gz. Its MD5 sum matches yours, so this is at least identical to the file. 

    For this file, we just ran a diff against UCSC's chr3 and chrY and can only find a single difference, the sequence identifier, which is "chr3" / "chrY" at UCSC and "CHR3" / "CHRY" for the Gencode file. Is it possible that there is a bug in the script that created this table?

    This reduces the actual differences to only chrM, which is documented by UCSC (hg19 was released before the "official" chrM was chosen. UCSC will most likely add a chrMT sequence for compatibility with the other genome versions.)

    As for Ensembl, depending on the exact URL, the Ensembl files are not the same as the GRC sequence. Ensembl pads the alternates with Ns to create full coordinate-compatible alternate chromosomes.

     

     

    0
    Comment actions Permalink
  • Avatar
    Maximilianh

    Sorry, I just saw that you did provide the URLs! Never mind my first question.

    Using these URLs, I cannot reproduce your full-file md5sums. The md5 of GCF_000001405.25_GRCh37.p13_genomic.fna at NCBI does not match the one in this post and the md5 of hg19.fa is also different.

    wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz -O - | zcat | md5sum

    530d89d3ef07fdb2a9b3c701fb4ca486

    wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.fna.gz  -O - | zcat | md5sum

    fbd575486dfa3b94d7e9bab87afa1c90

    I tried md5'ing the gzipped files, but that didn't match either.

    1
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk