Error using output of ScatterIntervalsByNs by SplitIntervals
Hi there,
when building an interval file with ScatterIntervalsByNs on the hg38 reference the output leads to an error when fed to SplitIntervals because of "A USER ERROR has occurred: Badly formed genome unclippedLoc: Query interval "@HD VN:1.6 SO:coordinate"is not valid for this input." I want to create the intervals file to be used by BaseRecalibrator as recommended by the INTEL GATK4 performance guide.
Can you help?
a) GATK version used
The Genome Analysis Toolkit (GATK) v4.1.6.0
HTSJDK Version: 2.21.2
Picard Version: 2.21.9
b) Exact GATK commands used
ref_fasta="/home/zyto/unger/GATK_ref/genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta"
interval_list="/home/zyto/unger/GATK_ref/genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta.intervals.list"
interval_list_folder="/home/zyto/unger/GATK_ref/genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta.intervals.list.folder"
singularity run /home/zyto/unger/gatk_latest.sif gatk --java-options "-Xmx4G -XX:+UseParallelGC -XX:ParallelGCThreads=4" \
ScatterIntervalsByNs \
--OUTPUT $interval_list \
--OUTPUT_TYPE=N \
--REFERENCE $ref_fasta
singularity run /home/zyto/unger/gatk_latest.sif gatk --java-options "-Xmx4G -XX:+UseParallelGC -XX:ParallelGCThreads=4" \
SplitIntervals \
--reference $ref_fasta \
--intervals $interval_list \
--scatter-count 4 \
--output $interval_list_folder
c) The entire error log if applicable.
[unger@frontser GATK_Exome_Lisa_HD]$ singularity run /home/zyto/unger/gatk_latest.sif gatk --java-options "-Xmx4G -XX:+UseParallelGC -XX:ParallelGCThreads=4" \
> SplitIntervals \
> --reference $ref_fasta \
> --intervals $interval_list \
> --scatter-count 4 \
> --output $interval_list_folder
Using GATK jar /gatk/gatk-package-4.1.6.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx4G -XX:+UseParallelGC -XX:ParallelGCThreads=4 -jar /gatk/gatk-package-4.1.6.0-local.jar SplitIntervals --reference /home/zyto/unger/GATK_ref/genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta --intervals /home/zyto/unger/GATK_ref/genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta.intervals.list --scatter-count 4 --output /home/zyto/unger/GATK_ref/genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta.intervals.list.folder
20:33:59.963 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.1.6.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Jul 07, 2020 8:34:00 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
20:34:00.180 INFO SplitIntervals - ------------------------------------------------------------
20:34:00.180 INFO SplitIntervals - The Genome Analysis Toolkit (GATK) v4.1.6.0
20:34:00.180 INFO SplitIntervals - For support and documentation go to https://software.broadinstitute.org/gatk/
20:34:00.180 INFO SplitIntervals - Executing as unger@frontser on Linux v3.10.0-957.el7.x86_64 amd64
20:34:00.180 INFO SplitIntervals - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_212-8u212-b03-0ubuntu1.16.04.1-b03
20:34:00.181 INFO SplitIntervals - Start Date/Time: July 7, 2020 8:33:59 PM UTC
20:34:00.181 INFO SplitIntervals - ------------------------------------------------------------
20:34:00.181 INFO SplitIntervals - ------------------------------------------------------------
20:34:00.181 INFO SplitIntervals - HTSJDK Version: 2.21.2
20:34:00.181 INFO SplitIntervals - Picard Version: 2.21.9
20:34:00.181 INFO SplitIntervals - HTSJDK Defaults.COMPRESSION_LEVEL : 2
20:34:00.181 INFO SplitIntervals - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
20:34:00.182 INFO SplitIntervals - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
20:34:00.182 INFO SplitIntervals - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
20:34:00.182 INFO SplitIntervals - Deflater: IntelDeflater
20:34:00.182 INFO SplitIntervals - Inflater: IntelInflater
20:34:00.182 INFO SplitIntervals - GCS max retries/reopens: 20
20:34:00.182 INFO SplitIntervals - Requester pays: disabled
20:34:00.182 INFO SplitIntervals - Initializing engine
20:34:00.512 INFO SplitIntervals - Shutting down engine
[July 7, 2020 8:34:00 PM UTC] org.broadinstitute.hellbender.tools.walkers.SplitIntervals done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=2174746624
***********************************************************************
A USER ERROR has occurred: Badly formed genome unclippedLoc: Query interval "@HD VN:1.6 SO:coordinate"is not valid for this input.
***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
-
Hi Kristian Unger , it seems as though you are supplying an interval "@HD VN:1.6 SO:coordinate" that cannot be used. Please check all your intervals and make sure they are valid.
-
Hi,I'm wondering if this issue got solved at last? I met a similar problem recently too when I was using the Collectreadcouts tool and I can't solve it. I use BedTointervalist to transfer hg38.exon.bed to hg38.exon.interval_list. And if I use hg38.exon.interval_list as the input of the Collectreadcounts, it goes well.However,if I use the PreprocessIntervals to transfer hg38.exon.interval_list to targers_Preprocess.interval.list first and use this one as the input of the Collectreadcounts,it goes wrong.
Here are mycodes:
## Preprocess Intervals
$GATK PreprocessIntervals \
-L ~/wes_cancer/data/hg38.exon.interval_list \
--sequence-dictionary ${dict} \
--reference ${ref} \
--bin-length 0 \
--padding 250 \
--interval-merging-rule OVERLAPPING_ONLY \
--output ~/wes_cancer/data/targets.preprocessed.interval.listinterval=~/wes_cancer/data/targets.preprocessed.interval.list
GATK=~/wes_cancer/biosoft/gatk-4.1.4.1/gatk
ref=~/wes_cancer/data/Homo_sapiens_assembly38.fastacat config | while read id
do
i=./5.gatk/${id}_bqsr.bam
echo ${i}
## step1 : CollectReadCounts
time $GATK --java-options "-Xmx7G -Djava.io.tmpdir=./" CollectReadCounts \
-I ${i} \
-L ${interval} \
-R ${ref} \
--format HDF5 \
--interval-merging-rule OVERLAPPING_ONLY \
--output ./8.cnv/gatk/counts/${id}.clean_counts.hdf5thank you!
-
Hi Tang Huatao,
I believe that the problem here is your file extension, ".interval.list". There are two kinds of interval list file formats supported by GATK:
- A simple list of intervals, one per line. This kind of file can have a .intervals or .list extension.
- A "Picard-style" interval list with a header, starting with a line like "@HD VN:1.6 SO:coordinate". This kind of file has a ".interval_list" extension.You have the second kind of interval list file (the kind with a header), but your ".list" extension is causing GATK to treat it as the first kind of file, and so it throws an error when it sees the header. Renaming "targets.preprocessed.interval.list" to "targets.preprocessed.interval_list" should solve the problem.
Regards,
David -
Hi David,
It works,thank you very much!
Regards
Tang Huatao
Please sign in to leave a comment.
4 comments