Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

ApplyBQSR to unmapped reads for non-model organism does not work with '-L unmapped' flag

0

8 comments

  • Avatar
    Genevieve Brandt

    Hello cali, can you look at your BAM file header and the FASTA dict file and confirm that the @SQ lines in your bam file match your dict file?

    0
    Comment actions Permalink
  • Avatar
    cali

    Hi Genevieve Brandt

    Thanks for your reply. Only the header differs:

    $ sdiff -s bam_headers.txt dict_first3cols.txt 
    @HD VN:1.3 SO:coordinate | @HD VN:1.4 SO:unsorted

    I would be surprised if this was the cause - particularly because the interval flag works fine for the contigs. 

    Many thanks,

    Cali 

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt

    cali could you validate your BAM file following these instructions

    0
    Comment actions Permalink
  • Avatar
    cali

    Hi Genevieve Brandt

    ValidateSamFile in summary mode shows the following:

    Error Type Count
    ERROR:INVALID_INSERT_SIZE 134030
    WARNING:REF_SEQ_TOO_LONG_FOR_BAI 1

    And with "IGNORE_WARNINGS" and verbose mode shows the first 100 events of "Insert size out of range" then exits, eg

    ERROR: Record 4791, Read name ST-E00180:231:H7CW3ALXX:2:2124:6705:68482, Insert size out of range

    Investigating the first record shows that the error is thrown by GATK disliking the very large chomosomes of this species (chr 1 is 716 Mb):

    ST-E00180:231:H7CW3ALXX:2:2124:6705:68482 435 LR735554.1 10444 5 15H40M95H = 652183949 652173615 GGGGGTGGGGGGAAACAGGGAAAAAATTGGAAAAAAAGGT ))7)-))))7----))7----A7-----7----7F-77AA NM:i:2 MD:Z:20G11C7 MC:Z:149M AS:i:30 XS:i:27 RG:Z:H7CW3ALXX.2_659BigJo_1 SA:Z:LR735554.1,652183849,+,21M1D60M69S,37,2; XA:Z:LR735560.1,+1334233,103S32M15S,1;

    Clearly this is not a BAM format error as an inferred insert size of this scale is biologically possible for this species (though probably a mapping error in this case). Is it safe to ignore these warnings? Can this be fixed in future GATK releases?

    Many thanks,

    Cali 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt

    Hi cali, I will look at this with my team and let you know when I have more information.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt

    Hi cali,

    The issue with ValidateSamFile does not seem to be related to your BQSR issue. PICARD has been updated since the version you are using to allow for a larger insert size. The INVALID_INSERT_SIZE error should disappear. Our current GATK version is 4.1.8.1.

    For the above issue with BQSR, I have some follow up questions:

    • What tool did you use to produce these bams?
    • If you use samtools to extract the unmapped only reads, do you get a warning? It may be something like: [W::sam_parse1] urecognized reference name; treated as unmapped
    • How many contigs are in the reference?
    0
    Comment actions Permalink
  • Avatar
    cali

    Hi Genevieve Brandt

    • The BAMs were aligned with BWA-mem v 0.7.17, sorted and indexed (CSI) with SAMtools v 1.10, BAMs merged to per-sample BAM with SAMbamba v 0.7.1, duplicate reads marked with SAMblaster v 0.1.24. 
    • There are no errors when the unmapped reads are extracted with samtools.  Extracting out the f12 unmapped reads to a new BAM with samtools and running ApplyBQSR without the '-L unmapped' flag results in no error and a recalibrated unmapped BAM file.
    • There are  105 contigs in the reference
    • I just tested the recalibration using '-L unmapped' with 4.1.8.1 and the same error was produced as per 4.1.2.0 and 4.1.4.0.

    Kind regards,

    Cali 

     

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt

    Hi cali, could you check these things?

    • Look for a read in your BAM file where the contig name is '187393'. You can use PrintReads -L 187393. 
    • The sequence dictionary or BAM index files may be corrupt. Re-create the sequence dictionary with CreateSequenceDictionary. Re-create the BAM index with IndexFeatureFile
    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk