GATK/Picard does not detect mates in paired-end BAM
Hi everyone,
I'm trying to revert a WGS hg38 BAM file to fastq, so I can filter the reads on quality and re-run it through our hg38 mapping pipeline. The data is a subset of the AML TARGET database. However, I get the following error when I try running it on all MT reads:
java -Djava.io.tmpdir=$TMPDIR -jar $PICARD ValidateSamFile \
I=/.../cromwell-executions/MitochondriaPipeline/f95acfa7-99bd-4a20-9cff-430042630083/call-SubsetBamToChrM/execution/PARHRSBMPC_dedup.bam \
MODE=SUMMARY \
Results:
HISTOGRAM java.lang.String
Error Type Count
ERROR:INVALID_VERSION_NUMBER 1
ERROR:MATE_NOT_FOUND 304414
However, the reads are paired, as can be detected by IGV and seen in the FLAG (I've deleted the sequence):
SRR1168035.161988987.1 83 MT 1178 40 100M = 1035 -243 DDDDDEEDDDDBDDDDDDDDDDDDDDDDDDDDDDDDDDDEEEDCBFHJJIGEHFGIGBF@HD?IHCIIJJIIJJJJJJJJJGHJIIJHHHHHFDDA4#CC NM:i:1 AS:i:98 XS:i:98 RG:Z:PAMYMABMNF_BCCAGSC_S1_L001_001
SRR1168035.161988987.2 163 MT 1035 40 100M = 1178 243 @CCFFFFFHHHHHJGIIJJJEHIGIGIEGIJJGGIGEIIGIJJJD#####00<FHI#.;DEHGGHHHE@BC############################# NM:i:14 AS:i:72 XS:i:72 RG:Z:PAMYMABMNF_BCCAGSC_S1_L001_001
module load picardtools/2.20.1
module load bwa/0.7.13
set -o pipefail
set -e
bash_ref_fasta=/.../Homo_sapiens_assembly38.chrM.fasta
java -Dpicard.useLegacyParser=false -jar $PICARD \
SamToFastq \
--INPUT /.../call-RevertSam/execution/PAMYMABMNF_dedup.bam \
--FASTQ /dev/stdout \
--INTERLEAVE true \
--NON_PF true | \
bwa mem -K 100000000 -p -v 3 -t 2 -Y $bash_ref_fasta /dev/stdin - 2> >(tee PAMYMABMNF_dedup.realigned.bwa.stderr.log >&2) | \
java -Dpicard.useLegacyParser=false -Djava.io.tmpdir=$TMPDIR -jar $PICARD \
MergeBamAlignment \
-VALIDATION_STRINGENCY SILENT \
-EXPECTED_ORIENTATIONS FR \
-ATTRIBUTES_TO_RETAIN X0 \
-ATTRIBUTES_TO_REMOVE NM \
-ATTRIBUTES_TO_REMOVE MD \
--ALIGNED_BAM /dev/stdin \
--UNMAPPED_BAM /MitochondriaPipeline/ee57c6c6-6792-46b5-8760-9e023f657b47/call-RevertSam/execution/PAMYMABMNF_dedup.bam \
--OUTPUT mba.bam \
--REFERENCE_SEQUENCE /.../Homo_sapiens_assembly38.chrM.fasta \
--PAIRED_RUN true \
--SORT_ORDER "unsorted" \
--IS_BISULFITE_SEQUENCE false \
--ALIGNED_READS_ONLY false \
--CLIP_ADAPTERS false \
--MAX_RECORDS_IN_RAM 2000000 \
--ADD_MATE_CIGAR true \
--MAX_INSERTIONS_OR_DELETIONS -1 \
--PRIMARY_ALIGNMENT_STRATEGY MostDistant \
--PROGRAM_RECORD_ID "bwamem" \
--PROGRAM_GROUP_VERSION "bwa/0.7.13" \
--PROGRAM_GROUP_COMMAND_LINE "bwa mem -K 100000000 -p -v 3 -t 2 -Y $bash_ref_fasta" \
--PROGRAM_GROUP_NAME "bwamem" \
--UNMAPPED_READ_STRATEGY COPY_TO_TAG \
--ALIGNER_PROPER_PAIR_FLAGS true \
--UNMAP_CONTAMINANT_READS true \
--ADD_PG_TAG_TO_READS false
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Found 304414 unpaired mates
at htsjdk.samtools.SAMUtils.processValidationError(SAMUtils.java:467)
at picard.sam.SamToFastq.doWork(SamToFastq.java:211)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
Exception in thread "main" picard.PicardException: Second read from pair not found in unmapped bam: SRR1168035.161988987.1, SRR1168035.161988987.2
at picard.sam.AbstractAlignmentMerger.mergeAlignment(AbstractAlignmentMerger.java:411)
at picard.sam.SamAlignmentMerger.mergeAlignment(SamAlignmentMerger.java:181)
at picard.sam.MergeBamAlignment.doWork(MergeBamAlignment.java:356)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
What are some additional steps I can undertake to detect the mates of the paired-end reads? I have already tried FixMateInformation with the following command:
java -jar $PICARD FixMateInformation \
I=/.../PAMYMABMNF_dedup.bam \
ADD_MATE_CIGAR=true
Thanks in advance!
-
Hi,
Take a look at this doc: https://gatk.broadinstitute.org/hc/en-us/articles/360035891231
-
Just for clarity sake, I will post the answer to my issue:
After scouring the internet, I realised that BAM/SAM lines do not contain reads but alignments (kind of a brain fart) and that paired end alignment names need to have the same query_name to be seen by processing software as being paired.
Therefore I created a simple pysam script to remove the .1s and .2s from the reads. Tada, all lines are now recognised as properly paired reads!
-
Thanks for the update jipvdinter. This will be useful to other community members!
Please sign in to leave a comment.
3 comments