Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GATK concordance

Answered
0

11 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi Priya,

    There is an error message in your program log starting with "java.lang.NullPointerException". We took a look at the stack trace and found that it is coming from a contig mismatch from your truth and eval VCFs. Specifically, the eval VCF has contigs that the truth VCF doesn't have.

    You can inspect the headers of the VCFs to determine the contig differences and how to proceed. After you take a look, please let me know if you have further questions.

    I also created a GATK ticket so that our developers can have a better error message for this case. Here is the link: https://github.com/broadinstitute/gatk/issues/7562

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Priyadarshini Thirunavukkarasu

    Hello

    Thank you. I figured that the contigs don't match and for the truth data set to operate over a defined set of regions in the evaluation sample, a interval list has to be specified. But, interval list is in a different format and the script stops running with the below error.

    A USER ERROR has occurred: Badly formed genome unclippedLoc: Contig NC_000001.11 given as location, but this contig isn't present in the Fasta sequence dictionary

    I have also pasted below the interval list which gives the above error


    @HD VN:1.5
    @SQ SN:NC_000001.11 LN:248956422
    @SQ SN:NT_187361.1 LN:175055
    @SQ SN:NT_187362.1 LN:32032
    @SQ SN:NT_187363.1 LN:127682
    @SQ SN:NT_187364.1 LN:66860
    @SQ SN:NT_187365.1 LN:40176
    @SQ SN:NT_187366.1 LN:42210
    @SQ SN:NT_187367.1 LN:176043
    @SQ SN:NT_187368.1 LN:40745

    Does anyone know how to format the interval list as shown in the below format:


    @HD VN:1.5
    @SQ SN:chr1 LN:248956422
    @SQ SN:chr2 LN:242193529
    @SQ SN:chr3 LN:198295559
    @SQ SN:chr4 LN:190214555
    @SQ SN:chr5 LN:181538259
    @SQ SN:chr6 LN:170805979
    @SQ SN:chr7 LN:159345973
    @SQ SN:chr8 LN:145138636
    @SQ SN:chr9 LN:138394717

    Thanks

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Priyadarshini Thirunavukkarasu,

    This is not a formatting issue with your interval list. The error message: "Contig NC_000001.11 given as location, but this contig isn't present in the Fasta sequence dictionary" is indicating that the first contig does not match the fasta file that you provided to the tool. This is another reference mismatch issue.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Priyadarshini Thirunavukkarasu

    Hello

    Sorry for the confusion.  I am trying to evaluate the filtered variants in filtered vcf with truth data set. The truth data vcf fies has chromosome number as chr. But, the filtered vcf has accession number in the vcf file to represent the chromosome number. The  "contig NC_000001.11" given in the error message is present in the filtered data set and it is not present in the truth data set (downloaded from GIAB website). Is there way to fix the issue, i.e. replace accession number or chromosome number in the vcf files. I found this issue  by creating intervals from truth data vcf files and filtered vcf data files.

    Thanks

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    The different contig names indicate that different references were used to map the reads and call variants. You'll need to find a truth set that uses the same reference version as your VCF.

    Please read this article for more information: https://gatk.broadinstitute.org/hc/en-us/articles/360035891131-Errors-about-input-files-having-missing-or-incompatible-contigs

    0
    Comment actions Permalink
  • Avatar
    Priyadarshini Thirunavukkarasu

    Hello

    I used GRCh38 as reference genome for mapping the exome data and the truth data set was from genome in bottle datasets and was also annotated with GRCh38. So, it is not clear, why there is an error and how to fix the error

    Thanks

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    You can check the contig lines in the header of your VCF: https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format

    You'll want to make sure that these match if they are truly from the same reference version.

    0
    Comment actions Permalink
  • Avatar
    Priyadarshini Thirunavukkarasu

    Thank you. So, the header of the filtered vcf data file and truth data vcf file should match?. If not, there are from different reference version. I will check the header of the vcf files. Is there way to visualize the header of vcf file which is gzipped. The truth vcf data set downloaded from the genome in bottle database is gzipped

    Thanks

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    The headers do not need to match exactly, only the contig lines.

    Here's a post on biostars about viewing zipped VCF files: https://www.biostars.org/p/454748/

    0
    Comment actions Permalink
  • Avatar
    Priyadarshini Thirunavukkarasu

    Thank you. In the truth data set downloaded from database, the contig position is given as chromosome number as Chr 1, Chr 2, Chr 3 and so on. In the filtered data set, the contigs are given as accession number. For example, NC_000001.11 is chromosome 1. The other difference, there are number of contigs in chromosome 1 in the filtered data set but in the truth data set, there is only contig for each chromosome

    Example in the truth data set

    @SQ SN:chr1 LN:248956422
    @SQ SN:chr2 LN:242193529
    @SQ SN:chr3 LN:198295559
    @SQ SN:chr4 LN:190214555
    @SQ SN:chr5 LN:181538259
    @SQ SN:chr6 LN:170805979
    @SQ SN:chr7 LN:159345973
    @SQ SN:chr8 LN:145138636
    @SQ SN:chr9 LN:138394717

    In filtered data set

    @SQ SN:NC_000001.11 LN:248956422 (Chr1 as given in the above table, length (LN) is same as shown above)
    @SQ SN:NT_187361.1 LN:175055 (additional contig Chr 1, not seen in above table)
    @SQ SN:NT_187362.1 LN:32032 (additional contig Chr 1, not seen in above table)
    @SQ SN:NT_187363.1 LN:127682 (additional contig Chr 1, not seen in above table)
    @SQ SN:NT_187364.1 LN:66860 (additional contig Chr 1, not seen in above table)
    @SQ SN:NT_187365.1 LN:40176 (additional contig Chr 1, not seen in above table)
    @SQ SN:NT_187366.1 LN:42210 (additional contig Chr 1, not seen in above table)
    @SQ SN:NT_187367.1 LN:176043 (additional contig Chr 1, not seen in above table)
    @SQ SN:NT_187368.1 LN:40745 (additional contig Chr 1, not seen in above table)
    @SQ SN:NT_187369.1 LN:41717 (additional contig Chr 1, not seen in above table)
    @SQ SN:NC_000002.12 LN:242193529 (Chromosome 2, please see the table above where the length (LN is same)

    These additional contigs in the filtered data set leads to error "incompatible contigs" or contigs not present in fasta sequence dictionary. I downloaded the reference genome GRCH38, patch 13 from NCBI assembly database to align the sample files. The truth data set was shown to be annotated from genome in bottle database (GRCh38). Not sure, how to evaluate the filtered data sets as there are additional contigs in the filtered data sets compared to the truth data sets. Please advice

    Thanks

    0
    Comment actions Permalink
  • 0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk