Merge Bam + Mark Duplicates breaks alignment
My input mapped bam has a general error rate of 0.02. After pushing this same bam through merge and mark, it breaks completely. If labeled and pushed directly into only Markdup, the error rate stays at this level. When using Merge and Mark, there are no errors and both processes complete, but the output bam has an error rate of 0.73 according to qualimap, and when I open it in IGV, it is utter unaligned chaos. I originally thought this was an issue of the aligner, but BWA MEM2, ngm and bowtie2 all produce fine alignments that seem to fall apart in the merge and mark. I think it is relevant that my data is shallow (~15x depth over genome 150 bp reads) and non-model organism.
What is going on? Validate sam says my uBAMs look fine.
REQUIRED for all errors and issues:
a) GATK version used: 4.2.6
b) Exact command used:
module load gatk
gatk --java-options "-Xmx54G" SortSam \
-I uBAMs/i224_snippet_fastqtosam.bam \
-O uBAMSORT/i224_ubam.bam \
--SORT_ORDER queryname
gatk --java-options "-Xmx54G" MergeBamAlignment -UNMAPPED i224_ubam.bam -ALIGNED Sort224mapQ.bam -R GCA_900291935.2_Cxam_T1-a_Genome_v2_genomic.fna -O 224ubamq_merge.bam
gatk MarkDuplicatesSpark -I 224ubamq_merge.bam -O 224ubamq_merge_mdup.bam -M 224ubamq_merge_marked_dup_metrics.txt
./qualimap bamqc -bam 224ubamq_merge_mdup.bam -oc 224ubamq_merge.file
c) Entire program log:
14:02:45.669 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/spack/opt/spack/linux-rhel8-icelake/gcc-8.4.1/gatk-4.2.6.1-zbtmus4wdwifjpu7hpmcfubgcyoa76xp/bin/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed Feb 21 14:02:45 PST 2024] SortSam --INPUT /data/kadenmuffett/Genomes/uBAMs/i224_snippet_fastqtosam.bam --OUTPUT /data/kadenmuffett/Genomes/uBAMSORT/i224_ubam.bam --SORT_ORDER queryname --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Wed Feb 21 14:02:45 PST 2024] Executing as kadenmuffett@node025.cluster on Linux 4.18.0-513.9.1.el8_9.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_392-b08; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.2.6.1
INFO 2024-02-21 14:03:13 SortSam Read 10,000,000 records. Elapsed time: 00:00:27s. Time for last 10,000,000: 27s. Last read position: */*
INFO 2024-02-21 14:03:38 SortSam Read 20,000,000 records. Elapsed time: 00:00:53s. Time for last 10,000,000: 25s. Last read position: */*
INFO 2024-02-21 14:04:03 SortSam Read 30,000,000 records. Elapsed time: 00:01:17s. Time for last 10,000,000: 24s. Last read position: */*
INFO 2024-02-21 14:04:28 SortSam Read 40,000,000 records. Elapsed time: 00:01:42s. Time for last 10,000,000: 25s. Last read position: */*
INFO 2024-02-21 14:04:53 SortSam Read 50,000,000 records. Elapsed time: 00:02:07s. Time for last 10,000,000: 24s. Last read position: */*
INFO 2024-02-21 14:04:54 SortSam Finished reading inputs, merging and writing to output now.
INFO 2024-02-21 14:05:10 SortSam Wrote 10,000,000 records from a sorting collection. Elapsed time: 00:02:24s. Time for last 10,000,000: 15s. Last read position: */*
INFO 2024-02-21 14:05:26 SortSam Wrote 20,000,000 records from a sorting collection. Elapsed time: 00:02:40s. Time for last 10,000,000: 15s. Last read position: */*
INFO 2024-02-21 14:05:42 SortSam Wrote 30,000,000 records from a sorting collection. Elapsed time: 00:02:56s. Time for last 10,000,000: 15s. Last read position: */*
INFO 2024-02-21 14:05:57 SortSam Wrote 40,000,000 records from a sorting collection. Elapsed time: 00:03:12s. Time for last 10,000,000: 15s. Last read position: */*
INFO 2024-02-21 14:06:13 SortSam Wrote 50,000,000 records from a sorting collection. Elapsed time: 00:03:27s. Time for last 10,000,000: 15s. Last read position: */*
[Wed Feb 21 14:06:14 PST 2024] picard.sam.SortSam done. Elapsed time: 3.48 minutes.
Runtime.totalMemory()=7464812544
Tool returned:
0
Using GATK jar /opt/spack/opt/spack/linux-rhel8-icelake/gcc-8.4.1/gatk-4.2.6.1-zbtmus4wdwifjpu7hpmcfubgcyoa76xp/bin/gatk-package-4.2.6.1-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx54G -jar /opt/spack/opt/spack/linux-rhel8-icelake/gcc-8.4.1/gatk-4.2.6.1-zbtmus4wdwifjpu7hpmcfubgcyoa76xp/bin/gatk-package-4.2.6.1-local.jar SortSam -I /data/kadenmuffett/Genomes/uBAMs/i224_snippet_fastqtosam.bam -O /data/kadenmuffett/Genomes/uBAMSORT/i224_ubam.bam --SORT_ORDER queryname
14:06:15.816 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/spack/opt/spack/linux-rhel8-icelake/gcc-8.4.1/gatk-4.2.6.1-zbtmus4wdwifjpu7hpmcfubgcyoa76xp/bin/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed Feb 21 14:06:15 PST 2024] MergeBamAlignment --UNMAPPED_BAM /data/kadenmuffett/Genomes/uBAMSORT/i224_ubam.bam --ALIGNED_BAM /data/kadenmuffett/Genomes/mapBam/Sort224mapQ.bam --OUTPUT /scratch/kadenmuffett/bowtieplayb/224ubamq_merge.bam --REFERENCE_SEQUENCE /scratch/kadenmuffett/Cxam_refgen/ncbi_dataset/data/GCA_900291935.2/GCA_900291935.2_Cxam_T1-a_Genome_v2_genomic.fna --ADD_PG_TAG_TO_READS true --PAIRED_RUN true --CLIP_ADAPTERS true --IS_BISULFITE_SEQUENCE false --ALIGNED_READS_ONLY false --MAX_INSERTIONS_OR_DELETIONS 1 --ATTRIBUTES_TO_REVERSE OQ --ATTRIBUTES_TO_REVERSE U2 --ATTRIBUTES_TO_REVERSE_COMPLEMENT E2 --ATTRIBUTES_TO_REVERSE_COMPLEMENT SQ --READ1_TRIM 0 --READ2_TRIM 0 --ALIGNER_PROPER_PAIR_FLAGS false --SORT_ORDER coordinate --PRIMARY_ALIGNMENT_STRATEGY BestMapq --CLIP_OVERLAPPING_READS true --HARD_CLIP_OVERLAPPING_READS false --INCLUDE_SECONDARY_ALIGNMENTS true --ADD_MATE_CIGAR true --UNMAP_CONTAMINANT_READS false --MIN_UNCLIPPED_BASES 32 --MATCHING_DICTIONARY_TAGS M5 --MATCHING_DICTIONARY_TAGS LN --UNMAPPED_READ_STRATEGY DO_NOT_CHANGE --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Wed Feb 21 14:06:15 PST 2024] Executing as kadenmuffett@node025.cluster on Linux 4.18.0-513.9.1.el8_9.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_392-b08; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.2.6.1
INFO 2024-02-21 14:06:15 SamAlignmentMerger Processing SAM file(s): [/data/kadenmuffett/Genomes/mapBam/Sort224mapQ.bam]
INFO 2024-02-21 14:06:16 AbstractAlignmentMerger Seen many non-increasing record positions. Printing Read-names as well.
INFO 2024-02-21 14:06:30 AbstractAlignmentMerger Merged 1,000,000 records. Elapsed time: 00:00:14s. Time for last 1,000,000: 13s. Last read position: OLMO02000301.1:9,152,813. Last read name: A01587:365:GW2304093045th:3:1123:20817:2190
INFO 2024-02-21 14:06:42 AbstractAlignmentMerger Merged 2,000,000 records. Elapsed time: 00:00:26s. Time for last 1,000,000: 12s. Last read position: OLMO02000297.1:4,484,653. Last read name: A01587:365:GW2304093045th:3:1145:30120:14857
INFO 2024-02-21 14:06:55 AbstractAlignmentMerger Merged 3,000,000 records. Elapsed time: 00:00:39s. Time for last 1,000,000: 12s. Last read position: OLMO02000006.1:4,037,915. Last read name: A01587:365:GW2304093045th:3:1168:10800:4288
(Continues)
-
Hi Kade Muffett
MergeBamAlignment tool restores hardclipped reads to softclipped status therefore it is normal to observe too many softclipped reads in IGV view when using this tool. Is this what you observe? Can you also elaborate on how you calculate the error rate?
Also our recommendation for this tool is to use it during or right after initial mapping stage. MarkDuplicates and any other steps should be used after merging the tagged unmapped bam with the aligned sam file.
MergeBamAlignment has the below major benefits
1- All reads stay as they are in the input fastq files and unmapped bases stay as softclipped. Produced bam file may be used to recreate fastq files close to their original form if needed.
2- Certain non-deterministic mappers (BWA) may mark any of the equally scoring pairs as primary whereas by changing this scoring strategy during MergeBamAlignment step you can determine which pair should be primary and which should go as supplementary.
3- Contaminating reads from microorganisms or pathogens can be cleaned up based on metrics already built in the tool therefore those reads will not clutter the original organisms reads.
4- Certain fastq tags can be transferred to aligned reads such as UMI or Adapter positions using this step.
5- Others that I cannot immediately think...
If you don't have any interest in any of these major points you may skip this step and continue as you please.
Regards.
-
The result does not appear to be restoration-related--the sequences no longer have any relation to the reference. The general error rate reported by qualimap is .73 and similarly the median mismatches reported in the picard tools quality check jumps to 107 per 150 bp read. Reads appear to have jumped places and retain their reported mapping quality score even though they have only accidental overlap with reference.
CATEGORY TOTAL_READS PF_READS PCT_PF_READS PF_NOISE_READS PF_READS_ALIGNED PCT_PF_READS_ALIGNED PF_ALIGNED_BASES PF_HQ_ALIGNED_READS PF_HQ_ALIGNED_BASES PF_HQ_ALIGNED_Q20_BASES PF_HQ_MEDIAN_MISMATCHES PF_MISMATCH_RATE PF_HQ_ERROR_RATE PF_INDEL_RATE MEAN_READ_LENGTH SD_READ_LENGTH MEDIAN_READ_LENGTH MAD_READ_LENGTH MIN_READ_LENGTH MAX_READ_LENGTH READS_ALIGNED_IN_PAIRS PCT_READS_ALIGNED_IN_PAIRS PF_READS_IMPROPER_PAIRS PCT_PF_READS_IMPROPER_PAIRS BAD_CYCLES STRAND_BALANCE PCT_CHIMERAS PCT_ADAPTER PCT_SOFTCLIP PCT_HARDCLIP AVG_POS_3PRIME_SOFTCLIP_LENGTH FIRST_OF_PAIR 25251356 25251356 1 0 16474979 0.652439 2.46E+09 12191873 1.83E+09 1.79E+09 107 0.734622 0.73429 0.000864 150 0 150 0 150 150 15457979 0.93827 1259639 0.076458 0 0.499888 0.005304 0.000011 0.001108 0 9.030712 SECOND_OF_PAIR 25251356 25251356 1 0 16347984 0.64741 2.44E+09 12121342 1.82E+09 1.78E+09 107 0.734597 0.734272 0.000872 150 0 150 0 150 150 15457979 0.945559 1132644 0.069283 0 0.500307 0.005332 0.000095 0.001107 0 9.02534 PAIR 50502712 50502712 1 0 32822963 0.649925 4.91E+09 24313215 3.64E+09 3.57E+09 107 0.734609 0.734281 0.000868 150 0 150 0 150 150 30915958 0.9419 2392283 0.072884 0 0.500096 0.005318 0.000053 0.001108 0 9.028026 -
Hi Kade Muffett
I see that your command line uses the parameter below.
--PRIMARY_ALIGNMENT_STRATEGY BestMapq
Can you try with a different parameter such as
--PRIMARY_ALIGNMENT_STRATEGY MostDistant
which is also in our best practices.
Regards.
Please sign in to leave a comment.
3 comments