HaplotypecallerSpark error
Hello, GATK team,
I'm using the HaplotypecallerSaprk to accelerate the variant calling step. However, I kept encountering the same problem, and I tried GATK 4.4.0.0 and GATK 4.1.8.0, and the results were the same. How can I solve the problem.
The command I used:
singularity run broadinstitute_gatk:latest.sif gatk HaplotypeCallerSpark -R GCF_003254395.2_Amel_HAv3.1_genomic.fna -I SRR11471560_mem2_sort_markdup.bam -O SRR11471560_mem2_spark_markdup.vcf.gz
The part of log file:
at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.write(SparkHadoopWriter.scala:358)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:132)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:129)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:141)
... 10 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)
... 31 more
Caused by: org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:157)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Sequence [VC HC_call @ NC_037640.1:2168 Q191.64 of type=SNP alleles=[C*, G] attr={AC=1, AF=0.5, AN=2, BaseQRankSum=0.636, DP=175, ExcessHet=3.0103, FS=4.591, MLEAC=[1], MLEAF=[0.5], MQ=33.03, MQRankSum=1.584, QD=1.10, ReadPosRankSum=-1.573, SOR=0.250} GT=[[AR0111 C*/G GQ 199 DP 174 AD 154,20 PL 199,0,4727]] filters= added out of order currentReferenceIndex: 1, referenceIndex:3
at htsjdk.tribble.index.tabix.AllRefsTabixIndexCreator.addFeature(AllRefsTabixIndexCreator.java:79)
at htsjdk.variant.variantcontext.writer.IndexingVariantContextWriter.add(IndexingVariantContextWriter.java:203)
at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:242)
at org.disq_bio.disq.impl.formats.vcf.HeaderlessVcfOutputFormat$VcfRecordWriter.write(HeaderlessVcfOutputFormat.java:93)
at org.disq_bio.disq.impl.formats.vcf.HeaderlessVcfOutputFormat$VcfRecordWriter.write(HeaderlessVcfOutputFormat.java:56)
The main error, I think, is java.lang.IllegalArgumentException: Sequence [VC HC_call @ NC_037640.1:2168 Q191.64 of type=SNP alleles=[C*, G] attr={AC=1, AF=0.5, AN=2, BaseQRankSum=0.636, DP=175, ExcessHet=3.0103, FS=4.591, MLEAC=[1], MLEAF=[0.5], MQ=33.03, MQRankSum=1.584, QD=1.10, ReadPosRankSum=-1.573, SOR=0.250} GT=[[AR0111 C*/G GQ 199 DP 174 AD 154,20 PL 199,0,4727]] filters= added out of order currentReferenceIndex: 1, referenceIndex:3
Thank you for your help !
-
HaplotypeCallerSpark is still in BETA therefore is unsupported. Recommended way to accelerate the calling step is to use scattered intervals based on N masked regions of the reference genome and collecting each intervals calls at the end using GatherVcfs tool.
On the other hand this issue could be due to the nature of the reads and their mates in the bam file. Can you run gatk ValidateSamFile tool to check if there are any errors due to mate CIGARS?
-
Hi Gökalp Çelik,
Thank you for replying to the issue.
I ran the ValidateSamFile tool, and some errors about the NM tags were wrong.There is the log file:
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.4.0.0-local.jar ValidateSamFile -I SRR11471560_mem2_sort_markdup.bam -M VERBOSE -R GCF_003254395.2_Amel_HAv3.1_genomic.fna --IGNORE_WARNINGS true
01:06:37.350 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Fri Aug 18 01:06:37 GMT 2023] ValidateSamFile --INPUT SRR11471560_mem2_sort_markdup.bam --MODE VERBOSE --IGNORE_WARNINGS true --REFERENCE_SEQUENCE GCF_003254395.2_Amel_HAv3.1_genomic.fna --MAX_OUTPUT 100 --VALIDATE_INDEX true --INDEX_VALIDATION_STRINGENCY EXHAUSTIVE --IS_BISULFITE_SEQUENCED false --MAX_OPEN_TEMP_FILES 8000 --SKIP_MATE_VALIDATION false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Fri Aug 18 01:06:37 GMT 2023] Executing as szu-ping.chen@Atlas-0124.HPC.MsState.Edu on Linux 3.10.0-1127.8.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 17.0.6+10-Ubuntu-0ubuntu118.04.1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.4.0.0
ERROR::INVALID_TAG_NM:Record 4896344, Read name SRR11471560.2659277, NM tag (nucleotide differences) in file [0] does not match reality [20]
ERROR::INVALID_TAG_NM:Record 4896345, Read name SRR11471560.2659277, NM tag (nucleotide differences) in file [0] does not match reality [20]
ERROR::INVALID_TAG_NM:Record 6777876, Read name SRR11471560.3409914, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 6777877, Read name SRR11471560.4602158, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 6777878, Read name SRR11471560.5446595, NM tag (nucleotide differences) in file [1] does not match reality [2]
ERROR::INVALID_TAG_NM:Record 6777879, Read name SRR11471560.901974, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 6777880, Read name SRR11471560.2129553, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 6777881, Read name SRR11471560.2138022, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 6777882, Read name SRR11471560.3409914, NM tag (nucleotide differences) in file [1] does not match reality [2]
ERROR::INVALID_TAG_NM:Record 6777883, Read name SRR11471560.4352437, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 6777884, Read name SRR11471560.6035416, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 7677935, Read name SRR11471560.229910, NM tag (nucleotide differences) in file [2] does not match reality [3]
ERROR::INVALID_TAG_NM:Record 7677936, Read name SRR11471560.2388598, NM tag (nucleotide differences) in file [2] does not match reality [3]
ERROR::INVALID_TAG_NM:Record 7677937, Read name SRR11471560.2449344, NM tag (nucleotide differences) in file [2] does not match reality [3]
ERROR::INVALID_TAG_NM:Record 7677938, Read name SRR11471560.4829057, NM tag (nucleotide differences) in file [2] does not match reality [3]
ERROR::INVALID_TAG_NM:Record 7677939, Read name SRR11471560.229910, NM tag (nucleotide differences) in file [4] does not match reality [5]
ERROR::INVALID_TAG_NM:Record 7677940, Read name SRR11471560.2146183, NM tag (nucleotide differences) in file [3] does not match reality [4]
ERROR::INVALID_TAG_NM:Record 7677941, Read name SRR11471560.2225864, NM tag (nucleotide differences) in file [7] does not match reality [8]
ERROR::INVALID_TAG_NM:Record 7677942, Read name SRR11471560.4603675, NM tag (nucleotide differences) in file [2] does not match reality [3]
ERROR::INVALID_TAG_NM:Record 7677943, Read name SRR11471560.7388997, NM tag (nucleotide differences) in file [2] does not match reality [3]
ERROR::INVALID_TAG_NM:Record 8883133, Read name SRR11471560.5078529, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 8883134, Read name SRR11471560.1266605, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 8883135, Read name SRR11471560.7109831, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 8883136, Read name SRR11471560.854710, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 8883138, Read name SRR11471560.6572018, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 8883139, Read name SRR11471560.154580, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 8883140, Read name SRR11471560.1266605, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 8883141, Read name SRR11471560.3484785, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 8883142, Read name SRR11471560.6555623, NM tag (nucleotide differences) in file [1] does not match reality [2]
ERROR::INVALID_TAG_NM:Record 8883143, Read name SRR11471560.6556712, NM tag (nucleotide differences) in file [1] does not match reality [2]
ERROR::INVALID_TAG_NM:Record 8883144, Read name SRR11471560.6568098, NM tag (nucleotide differences) in file [1] does not match reality [2]
ERROR::INVALID_TAG_NM:Record 8883145, Read name SRR11471560.7011866, NM tag (nucleotide differences) in file [1] does not match reality [2]
ERROR::INVALID_TAG_NM:Record 8883146, Read name SRR11471560.7382108, NM tag (nucleotide differences) in file [2] does not match reality [3]
ERROR::INVALID_TAG_NM:Record 8883147, Read name SRR11471560.3858984, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 8883148, Read name SRR11471560.6132294, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 8883149, Read name SRR11471560.689550, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 8883150, Read name SRR11471560.1813640, NM tag (nucleotide differences) in file [1] does not match reality [2]
ERROR::INVALID_TAG_NM:Record 8883151, Read name SRR11471560.7109831, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 8883154, Read name SRR11471560.3834858, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 8883155, Read name SRR11471560.3834858, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 8883156, Read name SRR11471560.4980834, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 9761527, Read name SRR11471560.2154582, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 9925436, Read name SRR11471560.219327, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 9925437, Read name SRR11471560.264396, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 9925438, Read name SRR11471560.915593, NM tag (nucleotide differences) in file [1] does not match reality [2]
ERROR::INVALID_TAG_NM:Record 9925439, Read name SRR11471560.1036577, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 9925440, Read name SRR11471560.1257820, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 9925441, Read name SRR11471560.1646678, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 9925442, Read name SRR11471560.1751811, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 9925443, Read name SRR11471560.3035529, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 9925444, Read name SRR11471560.3188778, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 9925445, Read name SRR11471560.5442335, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 9925446, Read name SRR11471560.6944111, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 9925447, Read name SRR11471560.1257820, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 9925448, Read name SRR11471560.4694967, NM tag (nucleotide differences) in file [1] does not match reality [2]
INFO 2023-08-18 01:07:23 SamFileValidator Validated Read 10,000,000 records. Elapsed time: 00:00:45s. Time for last 10,000,000: 45s. Last read position: NC_037648.1:9,824,885
ERROR::INVALID_TAG_NM:Record 11365750, Read name SRR11471560.4557892, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 11365752, Read name SRR11471560.3261983, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR::INVALID_TAG_NM:Record 11389607, Read name SRR11471560.4072500, NM tag (nucleotide differences) in file [3] does not match reality [6]
ERROR::INVALID_TAG_NM:Record 11389608, Read name SRR11471560.7018680, NM tag (nucleotide differences) in file [1] does not match reality [4]
ERROR::INVALID_TAG_NM:Record 11389610, Read name SRR11471560.1176829, NM tag (nucleotide differences) in file [11] does not match reality [14]
ERROR::INVALID_TAG_NM:Record 11389612, Read name SRR11471560.1610352, NM tag (nucleotide differences) in file [11] does not match reality [14]
ERROR::INVALID_TAG_NM:Record 11389614, Read name SRR11471560.3880086, NM tag (nucleotide differences) in file [3] does not match reality [6]
ERROR::INVALID_TAG_NM:Record 11389615, Read name SRR11471560.4514460, NM tag (nucleotide differences) in file [11] does not match reality [14]
ERROR::INVALID_TAG_NM:Record 11389617, Read name SRR11471560.837062, NM tag (nucleotide differences) in file [3] does not match reality [5]
ERROR::INVALID_TAG_NM:Record 11463081, Read name SRR11471560.3857956, NM tag (nucleotide differences) in file [5] does not match reality [7]
ERROR::INVALID_TAG_NM:Record 11463082, Read name SRR11471560.304836, NM tag (nucleotide differences) in file [3] does not match reality [5]
ERROR::INVALID_TAG_NM:Record 11463084, Read name SRR11471560.304732, NM tag (nucleotide differences) in file [1] does not match reality [2]
[Fri Aug 18 01:07:52 GMT 2023] picard.sam.ValidateSamFile done. Elapsed time: 1.26 minutes.
Runtime.totalMemory()=2076049408
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Tool returned:
3I tried to use the samtools clamd tool to solve the errors after processing by the calmd. There were no errors from the ValidateSamFile tool.
The command of calmd:
samtools calmd -bAr SRR11471560_mem2_sort_markdup.bam GCF_003254395.2_Amel_HAv3.1_genomic.fna > SRR11471560_sorted_calmd.bam
There is the log file for ValidateSamFile tool:
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.4.0.0-local.jar ValidateSamFile -I SRR11471560_sorted_calmd.bam -M VERBOSE -R GCF_003254395.2_Amel_HAv3.1_genomic.fna --IGNORE_WARNINGS true
01:12:09.818 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Fri Aug 18 01:12:09 GMT 2023] ValidateSamFile --INPUT SRR11471560_sorted_calmd.bam --MODE VERBOSE --IGNORE_WARNINGS true --REFERENCE_SEQUENCE GCF_003254395.2_Amel_HAv3.1_genomic.fna --MAX_OUTPUT 100 --VALIDATE_INDEX true --INDEX_VALIDATION_STRINGENCY EXHAUSTIVE --IS_BISULFITE_SEQUENCED false --MAX_OPEN_TEMP_FILES 8000 --SKIP_MATE_VALIDATION false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Fri Aug 18 01:12:09 GMT 2023] Executing as szu-ping.chen@Atlas-0124.HPC.MsState.Edu on Linux 3.10.0-1127.8.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 17.0.6+10-Ubuntu-0ubuntu118.04.1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.4.0.0
INFO 2023-08-18 01:13:01 SamFileValidator Validated Read 10,000,000 records. Elapsed time: 00:00:51s. Time for last 10,000,000: 51s. Last read position: NC_037648.1:9,824,885
No errors found
[Fri Aug 18 01:13:31 GMT 2023] picard.sam.ValidateSamFile done. Elapsed time: 1.36 minutes.
Runtime.totalMemory()=2076049408
Tool returned:
0However, when I used the file without error to run HaplotypecallerSpark, it still had the same error.
If I went the wrong way?
Thank you so much. -
Can you also try the locus where you get this error to be called with just the HaplotypeCaller (non-spark) to see if the problem still persists?
NC_037640.1:2168
You may try about 300 to 500 bases above and beyond this spot to see if there is an issue with the HaplotypeCaller itself or the spark version.
Also could you tell us your mapping steps and software that you use to map these reads?
And finally can you share the full log of your HaplotypeCallerSpark run?
-
Hi again.
One additional suggestion came from James Emery. The suggestion is to use a simple bed file to cover the regions of your interest in this bam file and use the -L parameter to give it to the HaplotypeCaller engine. It may be able to help you bypass the ref index problem therefore HaplotypeCallerSpark may finish properly.
Let us know if this works.
Regards.
-
Hi Gökalp Çelik,
Thank you for your reply.
I've tried the Haplotyecaller for the same file. The whole file ran successfully and didn't have to have a bed file to pass the problematic region.The mapping steps and tools that I used to process the files are:
Raw SRA data -> trim read by trimmomatic -> map the reads by BWA mem2 -> Sort the SAM file by samtools and covert to BAM file (also add the read group information) -> Markduplicate by gatk4 (picard) -> call variantsIn mapping the reads, I also tried the old version of BWA mem, which had the same error. Besides that, for the sorting step, I also tried Sortsam tool in gatk4, but it still had the same error in Haplotypecallerspark.
The full log file link of the HaplotypeCallerSpark:
https://drive.google.com/file/d/1GTxLGMoRzbb9emmF51MXt_B1nUMSutJN/view?usp=sharing
Thank you so much!
Regards. -
Hi again,
I tried the method of adding the interval list to HaplotypeCallerSpark, and it had another similar issue at a different location.
The part of the log file:
Caused by: java.lang.IllegalArgumentException: Sequence [VC HC_call @ NW_020555792.1:138 Q78.32 of type=SNP alleles=[T*, G] attr={AC=2, AF=1.0, AN=2, DP=2, ExcessHet=0.0000, FS=0.000, MLEAC=[1], MLEAF=[0.5], MQ=47.00, QD=29.09, SOR=0.693} GT=[[AR0111 G/G GQ 6 DP 2 AD 0,2 PL 90,6,0]] filters= added out of order currentReferenceIndex: 9, referenceIndex:11
at htsjdk.tribble.index.tabix.AllRefsTabixIndexCreator.addFeature(AllRefsTabixIndexCreator.java:79)
at htsjdk.variant.variantcontext.writer.IndexingVariantContextWriter.add(IndexingVariantContextWriter.java:203)
at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:242)
at org.disq_bio.disq.impl.formats.vcf.HeaderlessVcfOutputFormat$VcfRecordWriter.write(HeaderlessVcfOutputFormat.java:93)
at org.disq_bio.disq.impl.formats.vcf.HeaderlessVcfOutputFormat$VcfRecordWriter.write(HeaderlessVcfOutputFormat.java:56)
at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.write(SparkHadoopWriter.scala:368)
at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$executeTask$1(SparkHadoopWriter.scala:138)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1538)
at org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:135)
... 9 more
01:50:24.007 INFO ShutdownHookManager - Shutdown hook called
01:50:24.007 INFO ShutdownHookManager - Deleting directory /tmp/spark-89545ed5-67df-4063-886b-cb36ac1c0aeb
Using GATK jar /gatk/gatk-package-4.4.0.0-local.jar
Runnilp Çelik,ng:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.4.0.0-local.jar HaplotypeCallerSpark -R GCF_003254395.2_Amel_HAv3.1_genomic.fna -I SRR11471560_mem2_sort_markdup.bam -O SRR11471560_mem2_spark_gatk.vcf.gz -L GCF_003254395.2_Amel_HAv3.1_genomic.interval_listIt's weird that the problem would not occur when using the HaplotypeCaller.
Is there an alternative approach to attempt?Thank you.
Regards.
-
This looks like a bug within the spark implementation therefore needs further action on the developer end. For now you can use the regular HaplotypeCaller to call your variants and if you wish to accelerate the process you may scatter your genome into multiple intervals and feed each interval to a seperate HaplotypeCaller instance to call your variants in parallel. Final VCFs can be collected via GatherVcfs tool to generate the whole file.
I hope this helps.
-
Hi Gökalp Çelik,
We work on the inset genome data. However, most of the sequencing data don't have high integrity. Is it suitable to use the interval scattering method?
Regards.
-
Depending on how you scatter your intervals it should still hold true. Worst case scenario you may have to run HaplotypeCaller per contig/chromosome which is probably the safest way but if your reference is split by long repeats of N then you may want to split your intervals based on the positions of N repeats.
I hope this helps.
Please sign in to leave a comment.
9 comments