Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

HaplotypecallerSpark error

0

9 comments

  • Avatar
    Gökalp Çelik

    HiSzu-Ping, Chen

    HaplotypeCallerSpark is still in BETA therefore is unsupported. Recommended way to accelerate the calling step is to use scattered intervals based on N masked regions of the reference genome and collecting each intervals calls at the end using GatherVcfs tool. 

    On the other hand this issue could be due to the nature of the reads and their mates in the bam file. Can you run gatk ValidateSamFile tool to check if there are any errors due to mate CIGARS?

     

    0
    Comment actions Permalink
  • Avatar
    Szu-Ping, Chen

    Hi Gökalp Çelik,

    Thank you for replying to the issue.
    I ran the  ValidateSamFile tool, and some errors about the NM tags were wrong.

    There is the log file:

    Running:
        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.4.0.0-local.jar ValidateSamFile -I SRR11471560_mem2_sort_markdup.bam -M VERBOSE -R GCF_003254395.2_Amel_HAv3.1_genomic.fna --IGNORE_WARNINGS true
    01:06:37.350 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    [Fri Aug 18 01:06:37 GMT 2023] ValidateSamFile --INPUT SRR11471560_mem2_sort_markdup.bam --MODE VERBOSE --IGNORE_WARNINGS true --REFERENCE_SEQUENCE GCF_003254395.2_Amel_HAv3.1_genomic.fna --MAX_OUTPUT 100 --VALIDATE_INDEX true --INDEX_VALIDATION_STRINGENCY EXHAUSTIVE --IS_BISULFITE_SEQUENCED false --MAX_OPEN_TEMP_FILES 8000 --SKIP_MATE_VALIDATION false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
    [Fri Aug 18 01:06:37 GMT 2023] Executing as szu-ping.chen@Atlas-0124.HPC.MsState.Edu on Linux 3.10.0-1127.8.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 17.0.6+10-Ubuntu-0ubuntu118.04.1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.4.0.0
    ERROR::INVALID_TAG_NM:Record 4896344, Read name SRR11471560.2659277, NM tag (nucleotide differences) in file [0] does not match reality [20]
    ERROR::INVALID_TAG_NM:Record 4896345, Read name SRR11471560.2659277, NM tag (nucleotide differences) in file [0] does not match reality [20]
    ERROR::INVALID_TAG_NM:Record 6777876, Read name SRR11471560.3409914, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 6777877, Read name SRR11471560.4602158, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 6777878, Read name SRR11471560.5446595, NM tag (nucleotide differences) in file [1] does not match reality [2]
    ERROR::INVALID_TAG_NM:Record 6777879, Read name SRR11471560.901974, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 6777880, Read name SRR11471560.2129553, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 6777881, Read name SRR11471560.2138022, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 6777882, Read name SRR11471560.3409914, NM tag (nucleotide differences) in file [1] does not match reality [2]
    ERROR::INVALID_TAG_NM:Record 6777883, Read name SRR11471560.4352437, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 6777884, Read name SRR11471560.6035416, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 7677935, Read name SRR11471560.229910, NM tag (nucleotide differences) in file [2] does not match reality [3]
    ERROR::INVALID_TAG_NM:Record 7677936, Read name SRR11471560.2388598, NM tag (nucleotide differences) in file [2] does not match reality [3]
    ERROR::INVALID_TAG_NM:Record 7677937, Read name SRR11471560.2449344, NM tag (nucleotide differences) in file [2] does not match reality [3]
    ERROR::INVALID_TAG_NM:Record 7677938, Read name SRR11471560.4829057, NM tag (nucleotide differences) in file [2] does not match reality [3]
    ERROR::INVALID_TAG_NM:Record 7677939, Read name SRR11471560.229910, NM tag (nucleotide differences) in file [4] does not match reality [5]
    ERROR::INVALID_TAG_NM:Record 7677940, Read name SRR11471560.2146183, NM tag (nucleotide differences) in file [3] does not match reality [4]
    ERROR::INVALID_TAG_NM:Record 7677941, Read name SRR11471560.2225864, NM tag (nucleotide differences) in file [7] does not match reality [8]
    ERROR::INVALID_TAG_NM:Record 7677942, Read name SRR11471560.4603675, NM tag (nucleotide differences) in file [2] does not match reality [3]
    ERROR::INVALID_TAG_NM:Record 7677943, Read name SRR11471560.7388997, NM tag (nucleotide differences) in file [2] does not match reality [3]
    ERROR::INVALID_TAG_NM:Record 8883133, Read name SRR11471560.5078529, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 8883134, Read name SRR11471560.1266605, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 8883135, Read name SRR11471560.7109831, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 8883136, Read name SRR11471560.854710, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 8883138, Read name SRR11471560.6572018, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 8883139, Read name SRR11471560.154580, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 8883140, Read name SRR11471560.1266605, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 8883141, Read name SRR11471560.3484785, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 8883142, Read name SRR11471560.6555623, NM tag (nucleotide differences) in file [1] does not match reality [2]
    ERROR::INVALID_TAG_NM:Record 8883143, Read name SRR11471560.6556712, NM tag (nucleotide differences) in file [1] does not match reality [2]
    ERROR::INVALID_TAG_NM:Record 8883144, Read name SRR11471560.6568098, NM tag (nucleotide differences) in file [1] does not match reality [2]
    ERROR::INVALID_TAG_NM:Record 8883145, Read name SRR11471560.7011866, NM tag (nucleotide differences) in file [1] does not match reality [2]
    ERROR::INVALID_TAG_NM:Record 8883146, Read name SRR11471560.7382108, NM tag (nucleotide differences) in file [2] does not match reality [3]
    ERROR::INVALID_TAG_NM:Record 8883147, Read name SRR11471560.3858984, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 8883148, Read name SRR11471560.6132294, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 8883149, Read name SRR11471560.689550, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 8883150, Read name SRR11471560.1813640, NM tag (nucleotide differences) in file [1] does not match reality [2]
    ERROR::INVALID_TAG_NM:Record 8883151, Read name SRR11471560.7109831, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 8883154, Read name SRR11471560.3834858, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 8883155, Read name SRR11471560.3834858, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 8883156, Read name SRR11471560.4980834, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 9761527, Read name SRR11471560.2154582, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 9925436, Read name SRR11471560.219327, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 9925437, Read name SRR11471560.264396, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 9925438, Read name SRR11471560.915593, NM tag (nucleotide differences) in file [1] does not match reality [2]
    ERROR::INVALID_TAG_NM:Record 9925439, Read name SRR11471560.1036577, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 9925440, Read name SRR11471560.1257820, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 9925441, Read name SRR11471560.1646678, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 9925442, Read name SRR11471560.1751811, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 9925443, Read name SRR11471560.3035529, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 9925444, Read name SRR11471560.3188778, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 9925445, Read name SRR11471560.5442335, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 9925446, Read name SRR11471560.6944111, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 9925447, Read name SRR11471560.1257820, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 9925448, Read name SRR11471560.4694967, NM tag (nucleotide differences) in file [1] does not match reality [2]
    INFO    2023-08-18 01:07:23     SamFileValidator        Validated Read    10,000,000 records.  Elapsed time: 00:00:45s.  Time for last 10,000,000:   45s.  Last read position: NC_037648.1:9,824,885
    ERROR::INVALID_TAG_NM:Record 11365750, Read name SRR11471560.4557892, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 11365752, Read name SRR11471560.3261983, NM tag (nucleotide differences) in file [0] does not match reality [1]
    ERROR::INVALID_TAG_NM:Record 11389607, Read name SRR11471560.4072500, NM tag (nucleotide differences) in file [3] does not match reality [6]
    ERROR::INVALID_TAG_NM:Record 11389608, Read name SRR11471560.7018680, NM tag (nucleotide differences) in file [1] does not match reality [4]
    ERROR::INVALID_TAG_NM:Record 11389610, Read name SRR11471560.1176829, NM tag (nucleotide differences) in file [11] does not match reality [14]
    ERROR::INVALID_TAG_NM:Record 11389612, Read name SRR11471560.1610352, NM tag (nucleotide differences) in file [11] does not match reality [14]
    ERROR::INVALID_TAG_NM:Record 11389614, Read name SRR11471560.3880086, NM tag (nucleotide differences) in file [3] does not match reality [6]
    ERROR::INVALID_TAG_NM:Record 11389615, Read name SRR11471560.4514460, NM tag (nucleotide differences) in file [11] does not match reality [14]
    ERROR::INVALID_TAG_NM:Record 11389617, Read name SRR11471560.837062, NM tag (nucleotide differences) in file [3] does not match reality [5]
    ERROR::INVALID_TAG_NM:Record 11463081, Read name SRR11471560.3857956, NM tag (nucleotide differences) in file [5] does not match reality [7]
    ERROR::INVALID_TAG_NM:Record 11463082, Read name SRR11471560.304836, NM tag (nucleotide differences) in file [3] does not match reality [5]
    ERROR::INVALID_TAG_NM:Record 11463084, Read name SRR11471560.304732, NM tag (nucleotide differences) in file [1] does not match reality [2]
    [Fri Aug 18 01:07:52 GMT 2023] picard.sam.ValidateSamFile done. Elapsed time: 1.26 minutes.
    Runtime.totalMemory()=2076049408
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    Tool returned:
    3

    I tried to use the samtools clamd tool to solve the errors after processing by the calmd. There were no errors from the ValidateSamFile tool.

    The command of calmd:

    samtools calmd -bAr SRR11471560_mem2_sort_markdup.bam GCF_003254395.2_Amel_HAv3.1_genomic.fna > SRR11471560_sorted_calmd.bam

    There is the log file for ValidateSamFile tool:

    Running:
        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.4.0.0-local.jar ValidateSamFile -I SRR11471560_sorted_calmd.bam -M VERBOSE -R GCF_003254395.2_Amel_HAv3.1_genomic.fna --IGNORE_WARNINGS true
    01:12:09.818 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    [Fri Aug 18 01:12:09 GMT 2023] ValidateSamFile --INPUT SRR11471560_sorted_calmd.bam --MODE VERBOSE --IGNORE_WARNINGS true --REFERENCE_SEQUENCE GCF_003254395.2_Amel_HAv3.1_genomic.fna --MAX_OUTPUT 100 --VALIDATE_INDEX true --INDEX_VALIDATION_STRINGENCY EXHAUSTIVE --IS_BISULFITE_SEQUENCED false --MAX_OPEN_TEMP_FILES 8000 --SKIP_MATE_VALIDATION false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
    [Fri Aug 18 01:12:09 GMT 2023] Executing as szu-ping.chen@Atlas-0124.HPC.MsState.Edu on Linux 3.10.0-1127.8.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 17.0.6+10-Ubuntu-0ubuntu118.04.1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.4.0.0
    INFO    2023-08-18 01:13:01     SamFileValidator        Validated Read    10,000,000 records.  Elapsed time: 00:00:51s.  Time for last 10,000,000:   51s.  Last read position: NC_037648.1:9,824,885
    No errors found
    [Fri Aug 18 01:13:31 GMT 2023] picard.sam.ValidateSamFile done. Elapsed time: 1.36 minutes.
    Runtime.totalMemory()=2076049408
    Tool returned:
    0

    However, when I used the file without error to run HaplotypecallerSpark, it still had the same error.

    If I went the wrong way? 
    Thank you so much.

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Szu-Ping, Chen

    Can you also try the locus where you get this error to be called with just the HaplotypeCaller (non-spark) to see if the problem still persists?

    NC_037640.1:2168 

    You may try about 300 to 500 bases above and beyond this spot to see if there is an issue with the HaplotypeCaller itself or the spark version. 

    Also could you tell us your mapping steps and software that you use to map these reads? 

    And finally can you share the full log of your HaplotypeCallerSpark run?

     

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi again. 

    One additional suggestion came from James Emery. The suggestion is to use a simple bed file to cover the regions of your interest in this bam file and use the -L parameter to give it to the HaplotypeCaller engine. It may be able to help you bypass the ref index problem therefore HaplotypeCallerSpark may finish properly. 

    Let us know if this works. 

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Szu-Ping, Chen

    Hi Gökalp Çelik,

    Thank you for your reply.
    I've tried the Haplotyecaller for the same file. The whole file ran successfully and didn't have to have a bed file to pass the problematic region.

    The  mapping steps and tools that I used to process the files are:
    Raw SRA data -> trim read by trimmomatic -> map the reads by BWA mem2 -> Sort the SAM file by samtools and covert to BAM file (also add the read group information) -> Markduplicate by gatk4 (picard) -> call variants

    In mapping the reads, I also tried the old version of BWA mem, which had the same error. Besides that, for the sorting step, I also tried Sortsam tool in gatk4, but it still had the same error in Haplotypecallerspark.

    The full log file link of the HaplotypeCallerSpark:

    https://drive.google.com/file/d/1GTxLGMoRzbb9emmF51MXt_B1nUMSutJN/view?usp=sharing

    Thank you so much!

    Regards. 
     
     
     
    0
    Comment actions Permalink
  • Avatar
    Szu-Ping, Chen

    Hi again,

    I tried the method of adding the interval list to HaplotypeCallerSpark, and it had another similar issue at a different location.

    The part of the log file:

    Caused by: java.lang.IllegalArgumentException: Sequence [VC HC_call @ NW_020555792.1:138 Q78.32 of type=SNP alleles=[T*, G] attr={AC=2, AF=1.0, AN=2, DP=2, ExcessHet=0.0000, FS=0.000, MLEAC=[1], MLEAF=[0.5], MQ=47.00, QD=29.09, SOR=0.693} GT=[[AR0111 G/G GQ 6 DP 2 AD 0,2 PL 90,6,0]] filters= added out of order currentReferenceIndex: 9, referenceIndex:11
            at htsjdk.tribble.index.tabix.AllRefsTabixIndexCreator.addFeature(AllRefsTabixIndexCreator.java:79)
            at htsjdk.variant.variantcontext.writer.IndexingVariantContextWriter.add(IndexingVariantContextWriter.java:203)
            at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:242)
            at org.disq_bio.disq.impl.formats.vcf.HeaderlessVcfOutputFormat$VcfRecordWriter.write(HeaderlessVcfOutputFormat.java:93)
            at org.disq_bio.disq.impl.formats.vcf.HeaderlessVcfOutputFormat$VcfRecordWriter.write(HeaderlessVcfOutputFormat.java:56)
            at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.write(SparkHadoopWriter.scala:368)
            at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$executeTask$1(SparkHadoopWriter.scala:138)
            at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1538)
            at org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:135)
            ... 9 more
    01:50:24.007 INFO  ShutdownHookManager - Shutdown hook called
    01:50:24.007 INFO  ShutdownHookManager - Deleting directory /tmp/spark-89545ed5-67df-4063-886b-cb36ac1c0aeb
    Using GATK jar /gatk/gatk-package-4.4.0.0-local.jar
    Runnilp Çelik,ng:
        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.4.0.0-local.jar HaplotypeCallerSpark -R GCF_003254395.2_Amel_HAv3.1_genomic.fna -I SRR11471560_mem2_sort_markdup.bam -O SRR11471560_mem2_spark_gatk.vcf.gz -L GCF_003254395.2_Amel_HAv3.1_genomic.interval_list

    It's weird that the problem would not occur when using the HaplotypeCaller.
    Is there an alternative approach to attempt?

    Thank you.

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Szu-Ping, Chen

    This looks like a bug within the spark implementation therefore needs further action on the developer end. For now you can use the regular HaplotypeCaller to call your variants and if you wish to accelerate the process you may scatter your genome into multiple intervals and feed each interval to a seperate HaplotypeCaller instance to call your variants in parallel. Final VCFs can be collected via GatherVcfs tool to generate the whole file. 

    I hope this helps. 

    0
    Comment actions Permalink
  • Avatar
    Szu-Ping, Chen

    Hi Gökalp Çelik,

    We work on the inset genome data. However, most of the sequencing data don't have high integrity. Is it suitable to use the interval scattering method?

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Szu-Ping, Chen

    Depending on how you scatter your intervals it should still hold true. Worst case scenario you may have to run HaplotypeCaller per contig/chromosome which is probably the safest way but if your reference is split by long repeats of N then you may want to split your intervals based on the positions of N repeats. 

    I hope this helps. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk