Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

HaplotypeCaller error: Read is malformed: read starts with deletion

1

21 comments

  • Avatar
    谌于蓝

    I got the same error message, and I cannot find any read starts with a deletion in my input bam file. it's very strange. hope gatk team response soon@Derek Caetano-Anolles 

    0
    Comment actions Permalink
  • Avatar
    WVNicholson

    I'm getting a similar error message; so I'm interested in possible solutions to the problem.  (It seems like this should be a warning level message rather than a hard error message.)  Also, since a VCF file is actually created, I'm wondering if it would be complete or truncated due to early termination of HaplotypeCaller; because of the error.  I get the error in GATK v.4.1.5.0.  The above post suggests that just backing up to v.4.1.4.1 won't suppress the error.

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    HI Sue John WVNicholson 

     

    This might be a bug and we are investigating it. You can follow the progress for this issue here: https://github.com/broadinstitute/gatk/issues/6490

    To help us troubleshoot and fix this bug can you please share with us following:

    1. From the BAM file please share the problematic read with read name: Read A00262:316:HLW5MDSXX:3:1346:7844:3646 CHR01:1602331-1602484
    2. Please share with us the the input BAM as well

    You can find instructions to share data here: https://gatk.zendesk.com/hc/en-us/articles/360035889671

    0
    Comment actions Permalink
  • Avatar
    WVNicholson

    There's a medical confidentiality issue with the BAM file; so I will have to send an "excerpted" version of it, after confirming that I can reproduce the problem with that - depending on the bug and HaplotypeCaller works that may not be realistic.  (I'm wondering about that, having seen the bug report on GitHub.) Another thing is that the BAM file that goes into HaplotypeCaller was created by ApplyBQSR.  According to Picard's ValidateSamFile it has a lot of problems relating to mate pairs that weren't in the original BAM file (prior to doing BQSR).  I haven't tried running HaplotypeCaller on the original BAM file yet.

    0
    Comment actions Permalink
  • Avatar
    WVNicholson

    Okay I have tried running HaplotypeCaller on the original BAM file now.  Earlier I posted prematurely that it looked like it does not get the above error; but the job hadn't finished which obviously meant the error hadn't output yet and HaplotypeCaller really does get the same error with the original BAM as well.  (But ApplyBQSR seems to have some other problem which introduces a problem with mate pairs, according to Picard's ValidateSamFile which throws up hundreds of errors.)

    0
    Comment actions Permalink
  • Avatar
    WVNicholson

    Okay, I've been able to make a smaller BAM file that reproduces the error.  It appears that neither of the reads in the original BAM file with the same QNAME have the same cigar string as in the error message which suggests a bug to me - probably the one discussed on GitHub at the address above.

    0
    Comment actions Permalink
  • Avatar
    WVNicholson

    Okay, I just posted the archive file - haplotypecaller_ec0942_bug_report_11mar20.zip - with the bug report.  This has the smaller BAM file that reproduces the error from my last post.  Hopefully this is okay, I realised subsequently it is like the snippet file in your guidelines for making a detailed bug report; but it has less padding than suggested.  But since it reproduces the error, I think it ought to be sufficient.  Also, I've included the reference used.  This is just the human GRCh37 chromosome 1 sequence; but I thought I ought to send it in case there is still some ambiguity about the precise version, masking used, etc.  (I originally used a reference with all human chromosomes represented; but I thought it was a bit big to send.)

    0
    Comment actions Permalink
  • Avatar
    WVNicholson

    I believe I neglected to include the BED file used in the command line I provided.  I don't have the file handy right now as I'm at home; but I'm fairly sure it had a single genomic interval 1:153584200-153584400 (since I do have to hand an e-mail where I discussed this with a colleague).  In any case, I think the error will be reproduced without the "-L" argument that includes the BED file.

    By the way, the original BAM file was created by running BWA with the command line "bwa mem -t 8 -k 19 -w 100 -d 100 -r 1.500000 -c 10000 -A 1 -B 1 -O 6 -E 1 -L 5 -U 9 -T 30 -M reference/Homo_sapiens.GRCh37.68.dna.chromosome.all EC0942_1.fq.gz EC0942_2.fq.gz" and then subsequently processed by MarkDuplicates.  Both these can be checked by using "samtools view -H" on the BAM file that I've sent.  And the smaller  BAM file I sent was subsequently processed with "samtools view -L bad_read_region_rounded.bed -b EC0942.bam -o EC0942BadRegionRounded.bam" which is not revealed by "samtools view -H".

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    Sue John WVNicholson A bug fix for this is currently in code review: https://github.com/broadinstitute/gatk/pull/6498

    0
    Comment actions Permalink
  • Avatar
    WVNicholson

    It's looking like using GATK v.4.1.4.1 is a workaround for me despite what I said earlier.  (I guess previously Sue John meant v.4.1.4.0 specifically by v.4.1.4.)

    0
    Comment actions Permalink
  • Avatar
    Sue John

    WVNicholson: I had used v4.1.4.1-77.  Sorry for not mentioning it clearly before.

    0
    Comment actions Permalink
  • Avatar
    Julien Soubrier

    David Benjamin, I see you are currently working on https://github.com/broadinstitute/gatk/pull/6498.

    Would you still need examples of this happening? Happy to try to upload everything as a 'bud report' if it can be helpful (i.e., you'd still need it and I didn't do anything silly to start with...)

     

    BWA 0.7.17-r1188

    samtools 1.10 (using htslib 1.10.2)

    The Genome Analysis Toolkit (GATK) v4.1.5.0
    HTSJDK Version: 2.21.2
    Picard Version: 2.21.9

    (Previous cmds: fastp; BWA MEM; MarkDuplicatesSpark; BQSRPipelineSpark; samtools CRAM)

    time gatk --java-options "-Xms4g -Xmx32g" HaplotypeCaller -R /hg38/gatk_bundle/Homo_sapiens_assembly38.fasta -I MySample_dedupSorted_BQSRSpark.cram -O MySample_raw_variants.g.vcf.gz --bam-output MySample_hap.bam -ERC GVCF --intervals MyExomeRegions.bed

    # 12:20:44.608 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 4.059408435
    # 12:20:44.608 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 331.09542480100004
    # 12:20:44.608 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 1130.65 sec
    # 12:21:00.370 INFO HaplotypeCaller - Shutting down engine
    # [17 March 2020 12:21:00 PM] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 79.23 minutes.
    # Runtime.totalMemory()=16135487488
    # ***********************************************************************
    # A USER ERROR has occurred: Read NB501009:364:HFTJNBGXF:2:23209:22277:13092 chr15:40289938-40290013 is malformed: read starts with deletion. Cigar: 74H4D72M. Although the SAM spec technically permits such reads, this is often indicative of malformed files.
    # ***********************************************************************
    # Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
    # real 79m17.715s


    samtools view -@ 16 MySample_dedupSorted_BQSRSpark.cram -T /hg38/gatk_bundle/Homo_sapiens_assembly38.fasta | grep "NB501009:364:HFTJNBGXF:2:23209:22277:13092"

    # NB501009:364:HFTJNBGXF:2:23209:22277:13092 99 chr15 40289854 60 74M14D72M = 40290038 330 TGCATGGACAGCGCTAGGCAGTGCATCTCCTAAGGGCAAAGTCAGGGCCTGGGGCCACCCCTTCCTACTTGTGAAGAGAGAGAGAGAGAGAGAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTTAGGAAGGACACAG @DC@BEDBDBEE=EC@EEEBE@FEBBECECC@CEEEEBCCE@EBEEEECCFEEEECBDCCCCDECC@DCDF@FBCEBEBEBEBEBEBEBEBEBE@F@F@F@F@F@F@F@F@F@F@F@F@E@F@F@F@FA2ADE*;0A.3?CEA../ MC:Z:146M AS:i:106 XS:i:60 MD:Z:74^AGAGAGAGAGAGAG20A1A1A1A45 NM:i:18 RG:Z:MedEx
    # NB501009:364:HFTJNBGXF:2:23209:22277:13092 147 chr15 40290038 60 146M = 40289854 -330 TGCACTACTTCCTGGATGTGGGAGTTGTTAATCTCTCTCTTCAACCTGTTGGTGTAATTGGAGTTTTAGAAAAGGGAGAAAACCCCTAGGGAGAGTAGGGTAACATGAAGTCTACGCAGGATGTAGGCAAATGAGAAATCCAACAT .DFAEAAFD>FECDFCBD=CCECDABBC@D@BEBEBCBECBFD@EE>DCBCDBD@DBCBCECACCC>CEADDCC>ECEDDD@EEEE@CCCECECD@CCCD@D@FBBEDCDBE@@=EFCCEB>D@CCEFDDBBECEDDBBEFDAEAB MC:Z:74M14D72M AS:i:146 XS:i:20 MD:Z:146 NM:i:0 RG:Z:MedEx

    0
    Comment actions Permalink
  • Avatar
    Kelly Rabionet

    Hi, i encountered the same issue (malformed read starts with deletion) and a similar one (malformed read ends with deletion) on two different samples. These are the versions I am using:

    bwa-0.7.17

    htslib-1.10.2

    samtools-1.10

    picard-2.22.0-1

    GATK 4.1.5.0

    I performed alignment for three independent fastq files per sample, duplicate mark for the three files together (to merge them), and basequalityrecalibration without issues. Then, haplotype caller:

    gatk --java-options "-Xmx28G" HaplotypeCaller -R Homo_sapiens_assembly38.fasta -I my sample.merged.dm.recalibrated.bam -O mysample.gvcf.gz --dbsnp dbsnp_138.hg38.vcf.gz -ERC GVCF -L hg38_capture_targets.bed -ip 100 --bamout mysample.merged.haplotypecaller.bam

    This is the error message:

    A USER ERROR has occurred: Read J00148:222:HGG27BBXY:3:2121:11028:17509 chr3:142496281-142496374 is malformed: read starts with deletion. Cigar: 10D84M17H. Although the SAM spec technically permits such reads, this is often indicative of malformed files.

    and this is the read:

    J00148:222:HGG27BBXY:3:2121:11028:17509 163 chr3 142496191 60 101M = 142496291 201 GAAGAAAAGTGATAGAAAAGGAAGGTTCTTAATTTGGATTAGTTATGTTCTCTCAGTTGGCCTTTTAAGGAAGTACATAATTCCCGACATTATATATATAT ADDGDECEHBIC:(B;EEEHGDED7A?AF@CFC?,GG7<@?E=@<B:?93EEAACF?CJ,;=-+:6CEBCCB>BC:9<CFCEE0D>DDDC=BCBCCAB7BB MC:Z:101M MD:Z:88T0A11 PG:Z:MarkDuplicates RG:Z:HBM9c NM:i:2 AS:i:91XS:i:0
    J00148:222:HGG27BBXY:3:2121:11028:17509 83 chr3 142496291 60 101M = 142496191 -201 TATATATATATATATATATATATATATATATATATATATATATATCTGATGACATTTCCCTGGCCATTACCTTGGACCAGAGCCACTTTGCCCTTTCCACG BBCCC@AC@CAA*CCACBCC=?BBA@BC>C6CBB?CBCCCCABABI>6AEF<JCEDCC8DCA5DICFB0FGD?@ECH=DC9>GIB?AEDEGGHE;@FFB=A MC:Z:101M MD:Z:45A55 PG:Z:MarkDuplicates RG:Z:HBM9c NM:i:1 AS:i:96XS:i:50

    For the other file:

    A USER ERROR has occurred: Read K00310:319:HGFFJBBXY:8:1221:1753:5376 chrX:47144398-47144445 is malformed: read ends with deletion. Cigar: 45H45M11I3D. Although the SAM spec technically permits such reads, this is often indicative of malformed files.

    K00310:319:HGFFJBBXY:8:1221:1753:5376 99 chrX 47144353 60 90M4D11M = 47144447 161 GCCGGCCGCGCAGTCCTCTACTGCCTAAGAACGTCCTCTACAGCGCCCCATCTGCCGATCCCGGACTGACCCCTTCGGGGATCCGTCCCCCCCCCCCCCCC A@@=ED@?B?EBH?EDDED>DDGEDD>CHAED@@EDDED>DBEE@EDDDB@EDF=D@64A8D>GADDI.8@D8:>>=GD6-6B@(;?@=@+3+4>*+,=D> MC:Z:34S67M MD:Z:80T9^ACTA11 PG:Z:MarkDuplicates RG:Z:HBM9mb NM:i:5 AS:i:86 XS:i:20
    K00310:319:HGFFJBBXY:8:1221:1753:5376 147 chrX 47144447 60 34S67M = 47144353 -161 CCCGCCCCCCCCCCTCGGGGTTCCCCCCCCCCCTCCCCCCCCCCCCCCCCCGCCTCTCACCTTCTCATACAAGTTTTCGTCCTCGGGTTCTGGGTCCTCTT B85/>B-,-CAG?--()9)<---,,,-6,=,5-,AGGFGGFGGFGGEGG=@EGHAG@I.EHEAGAG>;?F+<CEECA:CAGH;=DDCDAGBDCDAGFCFE; MC:Z:90M4D11M MD:Z:67 PG:Z:MarkDuplicates RG:Z:HBM9mb NM:i:0 AS:i:67 XS:i:29

    just in case it is of any use.

     

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi all, 

     

    We are working on fixing this issue. We are going to have the next release of GATK coming out in the next few days with this fix. Once it is released you will see it first here: https://github.com/broadinstitute/gatk/releases

    0
    Comment actions Permalink
  • Avatar
    Steve Laurie

    Dear Bhanu Gandham,

    I realise this is a challenging time for all, and I understand if the GATK developers are not working at full pace at the moment as we all have higher priorities. So this is just a friendly request to ask if you know if there will be a new release that will resolve the issues described above, as we have encountered the same fatal error.

    If the new release is likely to be delayed substantially, for quite understandable regions, could you recommend a prior 4.1x.x release for which you know that HaplotypeCaller was producing good results.

    Thanks a lot, practice social distancing, and stay safe!

    Steve (Spain)

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    Sue John WVNicholson Julien Soubrier Kelly Rabionet Steve Laurie I don't know when our engine team will release next, but here is a pre-release build in a public google bucket: gs://broad-dsde-methods-davidben/gatk-builds/gatk-4.1.5.1-pre-release.jar

    Please let me know if any issues remain, and please continue posting about bugs and giving us friendly reminders as needed — they're probably more needed these days!  I don't presume to speak for others, but while homeschooling three kids by day and working by night, one thing keeping me sane is being part of a professional team with pride in our work, virus or no virus.

    0
    Comment actions Permalink
  • Avatar
    Steve Laurie

    Hi David Benjamin

    Thanks for the rapid reply. I would like to test your pre-release.jar, but the URL isn't correct. I have tried to work out what it should be, without any success. Perhaps it isn't public?

    It looks like there may be a permissions issue.

    Thanks,

    Steve

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    Steve Laurie It's a google bucket path, not a URL.  You can copy it locally with gsutil cp gs://______ <local path>

    0
    Comment actions Permalink
  • Avatar
    Steve Laurie

    Dear David Benjamin

    Thanks a lot for the reply. It's been a while since I accessed a google bucket so I didn't remember, sorry.

    For now, as we want something that will be robust, but we are in a bit of a hurry, we are going to benchmark against NA12878 from NIST using 4.1.4.1 and 4.1.4.0.

    Thanks for your help, and keep up the good work - we all really appreciate it!

    All the best, Steve

    0
    Comment actions Permalink
  • Avatar
    Steve Laurie

    Dear all,

    I see there has been a new release that hopefully resolves the HC bug. I will be testing over the next couple of days and will respond with results on this thread.

    Thanks a lot to the Developers for their ongoing work on this key tool. Hope all reading are in good health and respecting social distancing. Look after you and yours.

    Steve

    0
    Comment actions Permalink
  • Avatar
    Julien Soubrier

    Hi all,

    Trying the new release overnight, I re-ran the exact same command on the same data set and it completed with no error for the 11 exomes.

    Thanks a lot David Benjamin, Bhanu Gandham and the rest of the team for implementing a patch so quickly.

     

    Stay safe everyone.

    Cheers from down under.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk