INVALID_CIGAR error after ClipOverlap
When processing paired-end low-coverage historical whole genome sequences, I have run into an error (INVALID_CIGAR: No real operator (M|I|D|N) in CIGAR). This error only appears after BamUtil ClipOverlap, but I have been unable to figure out the source or the solution. I am new to bioinformatics and genomics so any guidance would be most appreciated.
The pipeline I have been following is:
- Trimming with TrimGalore! and CutAdapt
- Mapping, filtering, and sorting with Bowtie2, SamTools view, and SamTools sort
- Deduplication and clip with Picard MarkDuplicates and BamUtil ClipOverlap
The exact command used is:
for SAMPLEBAM in
cat $BAMLIST
; dojava -Xmx60g -jar $PICARD MarkDuplicates I=$BASEDIR'/bam/'$SAMPLEBAM'bt2'$REFNAME'_minq20_sorted.bam' O=$BASEDIR'/bam/'$SAMPLEBAM'bt2'$REFNAME'_minq20_sorted_dedup.bam' M=$BASEDIR'/bam/'$SAMPLEBAM'bt2'$REFNAME'_minq20_sorted_dupstat.txt'
VALIDATION_STRINGENCY=SILENT
REMOVE_DUPLICATES=true$BAMUTIL clipOverlap --in $BASEDIR'/bam/'$SAMPLEBAM'bt2'$REFNAME'_minq20_sorted_dedup.bam' --out $BASEDIR'/bam/'$SAMPLEBAM'bt2'$REFNAME'_minq20_sorted_dedup_overlapclipped.bam'
--stats
done
After this step, ValidateSamFile in summary mode for the first sample returns:
ERROR: INVALID_CIGAR 4726347
ERROR: MATE_NOT_FOUND 421257
ERROR: MISSING_READ_GROUP 1
WARNING: RECORD_MISSING_READ_GROUP 10160741
Verbose mode on ValidateSamFile outputs the following error (with differences for read name and record number) until it reaches the 100 error limit:
ERROR: INVALID_CIGAR: Record 38, Read name A00437:550:HN7CLDSX3:1:1553:19425:21386, No real operator (M|I|D|N) in CIGAR
I have attempted to fix this by using Picard FixMateInformation and Picard AddOrReplaceGroups. These remove the missing read group errors, but introduce CIGAR_MAPS_OFF_REFERENCE and MSIMATCH_MATE_CIGAR_STRING errors, and do not fix the MATE_NOT_FOUND error.
Picard version 2.27.5-4-g5295289-SNAPSHOT
openjdk version 1.8.0_432
gatk-4.3.0.0
BamUtil Version 1.0.15
From ValidateSamFile Summary mode:
From verbose mode:
-
We cannot give any recommendations on BamUtil however let me ask you a few questions to track this issue.
1- Do you observe this kind of error messages before BamUtil step?
2- Is there a particular reason why you would like to Clip overlapping reads? Many of our tools can handle read overlaps properly and there is no need to clip those reads.
Regards
-
Hi Gökalp Çelik. Thanks for your response!
To answer the questions:
1- The CIGAR issue is not observed before the BamUtil step, but the "ERROR: MATE_NOT_FOUND 421257" is. Running AddOrReplaceGroups resolved read group errors and warnings.
2- ClipOverlap was part of the tutorial I used to guide my pipeline (https://github.com/nt246/lcwgs-guide-tutorial). To be honest, skipping this step would have saved me a headache.
Much appreciated
-
Hi again.
Looks like you can skip that step since many contemporary variant callers have proper algorithms to handle overlapping basepairs from paired reads and these methods are usually on by default i.e. HaplotypeCaller.
Regards.
Please sign in to leave a comment.
3 comments