Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

mutect2 error: "Unknown file is malformed: Could not read sequence dictionary from given fasta file refereces.dict"

Answered
0

6 comments

  • Avatar
    Bhanu Gandham

    zdr j

     

    Your germline resource and panel of normals is based on hg38 while your reference is hs37d5. I believe that is where this error is coming from. Please try again with the hg38 reference provided here.

    0
    Comment actions Permalink
  • Avatar
    zdr j

    Hi 

    Thanks for your reply. I used hg38 reference before, and then I faced this error: "Input files master sequence dictionary and reads have incompatible contigs: No overlapping contigs found". My input data is a sliced sequence of cancer WGS reads (BAM) from ICGC which according to ICGC is aligned read by BWA_MEM software. I download to local machine using score-client. Checking the file features in ICGC, the reference is genomic build GRCh37, and reference name is hs37d5. So I thought my problem may be solved using a compatible reference (am I right on this?), which I downloaded from two sources (hs37d5) and the first error I mentioned in the post came up. But yes, now I am not sure how to reach all other files (PON and germline) compatible with hs37d5.  I would be very thankful for help on this, this is my first project with GATK and I would really appreciate your guidance to pass through.

    Best



    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    zdr j

     

    With GATK you need to use the same reference for alignment and variant calling. 

    I used hg38 reference before, and then I faced this error: "Input files master sequence dictionary and reads have incompatible contigs: No overlapping contigs found". 

    From this error message, it looks like either the alignment was done with a different reference(which GATK does not accept), or the sequence dictionary is incorrect. 

    Here are a few things to try:

    1. Create the sequence dictionary again and rerun -  this is to verify if the sequence dictionary is accurate
    2. Start from raw reads and align your reads to hg38 yourself using bwa-mem. This might be your best bet.  
    3. We do not provide PON and germline resource files for hs37d5. Unfortunately there isn't much I can help you with there. One option here is to run Mutect in tumor only mode, but that comes with its own caveats. You can read about it more here:
      https://gatk.broadinstitute.org/hc/en-us/articles/360051306691-Mutect2
      https://gatk.broadinstitute.org/hc/en-us/articles/360050722212-FAQ-for-Mutect2

    0
    Comment actions Permalink
  • Avatar
    zdr j

    ُThank you so much for your reply and guidance. I have to go for aligning the reads myself, hope that works straightforward with no errors :((( 

    Best

    0
    Comment actions Permalink
  • Avatar
    zdr j

    Hello, 

    I used crossmap to realign my reads to hg38. but now again I face this problem:

    "A USER ERROR has occurred: Unknown file is malformed: Could not read sequence dictionary from given fasta file references_hg38_v0_Homo_sapiens_assembly38.dict".

    this is the code I used:

    gatk Mutect2 \

         -R references_hg38_v0_Homo_sapiens_assembly38.fasta \

         -I test.hg38.sam \

         -I ctrl.sam \

         -normal UCR_1 \

         -L resources_broad_hg38_v0_wgs_calling_regions.hg38.interval_list\

         --sequence-dictionary references_hg38_v0_Homo_sapiens_assembly38.dict\

         --germline-resource somatic-hg38_af-only-gnomad.hg38.vcf.gz \

         --panel-of-normals somatic-hg38_1000g_pon.hg38.vcf.gz\

         --output somatic.vcf.gz\

     

    I would be thankful for any help.

     

    Best

     

    0
    Comment actions Permalink
  • Avatar
    zdr j

    I downloaded "references_hg38_v0_Homo_sapiens_assembly38.fasta" from google cloud broad institute resources and then used: 

    gatk CreateSequenceDictionary \

    -R references_hg38_v0_Homo_sapiens_assembly38.fasta  to make the "dict" file and also:

    samtools faidx references_hg38_v0_Homo_sapiens_assembly38.fasta

    to make the fai file. 

    so I do not know what is wrong! :((( 

    Is it that crossmap is not a good way for this project to realign the reads to hg38? 

    I really need help to figure this out

    Best

     

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk