mutect2 error: "Unknown file is malformed: Could not read sequence dictionary from given fasta file refereces.dict"
AnsweredHellor
I am extremely in need of your help to figure out the stubborn error I face with GATK mutect2
I face an error saying:"Unknown file is malformed: Could not read sequence dictionary from given fasta file refereces.dict", running the pipeline explained fully below, I would be very grateful if any help to figure out what is wrong.
Best
1) I am using the latest version of GATK
2) I downloaded the human genome reference hs37d5 file both from ICGC and the following source:
https://console.cloud.google.com/storage/browser/genomics-public-data/references/hs37d5
3) unzipped the gz file using:
zcat references_hs37d5.fa.gz> references_hs37d5.fasta
4) created the dict and fai files using the following commands:
gatk CreateSequenceDictionary \
-R references_hs37d5.fasta
samtools faidx references_hs37d5.fasta
5) and then ran the mutect2 command as follows:
docker run -v ~/my_project:/gatk/my_data -it broadinstitute/gatk:latest
cd my_data
gatk Mutect2 \
-R references_hs37d5.fasta \
-I tumor.sam \
-I normal.sam \
-normal normal_test \
-L resources_broad_hg38_v0_wgs_calling_regions.hg38.interval_list\
--sequence-dictionary references_hs37d5.dict\
--germline-resource somatic-hg38_af-only-gnomad.hg38.vcf.gz \
--panel-of-normals somatic-hg38_1000g_pon.hg38.vcf.gz\
--ignore-itr-artifacts\
--output somatic.vcf.gz\
c) Entire error log:
gatk Mutect2 \
> -R references_hs37d5.fasta \
> -I tumor.sam \
> -I normal.sam \
> -normal normal_test \
> -L resources_broad_hg38_v0_wgs_calling_regions.hg38.interval_list\
> --sequence-dictionary references_hs37d5.dict\
> --germline-resource somatic-hg38_af-only-gnomad.hg38.vcf.gz \
> --panel-of-normals somatic-hg38_1000g_pon.hg38.vcf.gz\
> --ignore-itr-artifacts\
> --output somatic.vcf.gz\
>
Using GATK jar /gatk/gatk-package-4.1.9.0-SNAPSHOT-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.1.9.0-SNAPSHOT-local.jar Mutect2 -R references_hs37d5.fasta -I tumor.sam -I normal.sam -normal normal_test -L resources_broad_hg38_v0_wgs_calling_regions.hg38.interval_list --sequence-dictionary references_hs37d5.dict --germline-resource somatic-hg38_af-only-gnomad.hg38.vcf.gz --panel-of-normals somatic-hg38_1000g_pon.hg38.vcf.gz --ignore-itr-artifacts --output somatic.vcf.gz
09:42:26.344 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.1.9.0-SNAPSHOT-local.jar!/com/intel/gkl/native/libgkl_compression.so
Feb 12, 2021 9:42:27 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
09:42:27.048 INFO Mutect2 - ------------------------------------------------------------
09:42:27.049 INFO Mutect2 - The Genome Analysis Toolkit (GATK) v4.1.9.0-SNAPSHOT
09:42:27.049 INFO Mutect2 - For support and documentation go to https://software.broadinstitute.org/gatk/
09:42:27.050 INFO Mutect2 - Executing as root@3a4240172fb0 on Linux v5.4.39-linuxkit amd64
09:42:27.050 INFO Mutect2 - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_242-8u242-b08-0ubuntu3~18.04-b08
09:42:27.050 INFO Mutect2 - Start Date/Time: February 12, 2021 9:42:26 AM GMT
09:42:27.050 INFO Mutect2 - ------------------------------------------------------------
09:42:27.051 INFO Mutect2 - ------------------------------------------------------------
09:42:27.052 INFO Mutect2 - HTSJDK Version: 2.23.0
09:42:27.052 INFO Mutect2 - Picard Version: 2.23.3
09:42:27.052 INFO Mutect2 - HTSJDK Defaults.COMPRESSION_LEVEL : 2
09:42:27.053 INFO Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
09:42:27.053 INFO Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
09:42:27.053 INFO Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
09:42:27.053 INFO Mutect2 - Deflater: IntelDeflater
09:42:27.054 INFO Mutect2 - Inflater: IntelInflater
09:42:27.054 INFO Mutect2 - GCS max retries/reopens: 20
09:42:27.054 INFO Mutect2 - Requester pays: disabled
09:42:27.054 INFO Mutect2 - Initializing engine
09:42:27.138 INFO Mutect2 - Shutting down engine
[February 12, 2021 9:42:27 AM GMT] org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2 done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=169345024
***********************************************************************
A USER ERROR has occurred: Unknown file is malformed: Could not read sequence dictionary from given fasta file references_hs37d5.dict
***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
-
Your germline resource and panel of normals is based on hg38 while your reference is hs37d5. I believe that is where this error is coming from. Please try again with the hg38 reference provided here.
-
Hi
Thanks for your reply. I used hg38 reference before, and then I faced this error: "Input files master sequence dictionary and reads have incompatible contigs: No overlapping contigs found". My input data is a sliced sequence of cancer WGS reads (BAM) from ICGC which according to ICGC is aligned read by BWA_MEM software. I download to local machine using score-client. Checking the file features in ICGC, the reference is genomic build GRCh37, and reference name is hs37d5. So I thought my problem may be solved using a compatible reference (am I right on this?), which I downloaded from two sources (hs37d5) and the first error I mentioned in the post came up. But yes, now I am not sure how to reach all other files (PON and germline) compatible with hs37d5. I would be very thankful for help on this, this is my first project with GATK and I would really appreciate your guidance to pass through.
Best
-
With GATK you need to use the same reference for alignment and variant calling.
I used hg38 reference before, and then I faced this error: "Input files master sequence dictionary and reads have incompatible contigs: No overlapping contigs found".
From this error message, it looks like either the alignment was done with a different reference(which GATK does not accept), or the sequence dictionary is incorrect.
Here are a few things to try:
- Create the sequence dictionary again and rerun - this is to verify if the sequence dictionary is accurate
- Start from raw reads and align your reads to hg38 yourself using bwa-mem. This might be your best bet.
- We do not provide PON and germline resource files for hs37d5. Unfortunately there isn't much I can help you with there. One option here is to run Mutect in tumor only mode, but that comes with its own caveats. You can read about it more here:
https://gatk.broadinstitute.org/hc/en-us/articles/360051306691-Mutect2
https://gatk.broadinstitute.org/hc/en-us/articles/360050722212-FAQ-for-Mutect2
-
ُThank you so much for your reply and guidance. I have to go for aligning the reads myself, hope that works straightforward with no errors :(((
Best
-
Hello,
I used crossmap to realign my reads to hg38. but now again I face this problem:
"A USER ERROR has occurred: Unknown file is malformed: Could not read sequence dictionary from given fasta file references_hg38_v0_Homo_sapiens_assembly38.dict".
this is the code I used:
gatk Mutect2 \
-R references_hg38_v0_Homo_sapiens_assembly38.fasta \
-I test.hg38.sam \
-I ctrl.sam \
-normal UCR_1 \
-L resources_broad_hg38_v0_wgs_calling_regions.hg38.interval_list\
--sequence-dictionary references_hg38_v0_Homo_sapiens_assembly38.dict\
--germline-resource somatic-hg38_af-only-gnomad.hg38.vcf.gz \
--panel-of-normals somatic-hg38_1000g_pon.hg38.vcf.gz\
--output somatic.vcf.gz\
I would be thankful for any help.
Best
-
I downloaded "references_hg38_v0_Homo_sapiens_assembly38.fasta" from google cloud broad institute resources and then used:
gatk CreateSequenceDictionary \
-R references_hg38_v0_Homo_sapiens_assembly38.fasta to make the "dict" file and also:
samtools faidx references_hg38_v0_Homo_sapiens_assembly38.fasta
to make the fai file.
so I do not know what is wrong! :(((
Is it that crossmap is not a good way for this project to realign the reads to hg38?
I really need help to figure this out
Best
Please sign in to leave a comment.
6 comments