picard CompareSAMs
AnsweredHi! I recently compared BAMs created with two slightly different version of a workflow and used picard CompareSAMs to compare the outputs. In my output tsv all *_DIFFER and MISSING_* are 0 yet ARE_EQUAL is N. I would expect if none of the mappings or duplicate markings differ and no alignments are missing from either file, then ARE_EQUAL would be Y. The only thing I can think of is that the read group ID is different in the two files. And one pipeline has more commands in the header. Both of these differences are reflected in the standard out:
Read Group ID differs.
File 1: JDPHV_S1_L001_aa
File 2: JDPHV_S1_L001
Number of program records differs.
File 1: 2
File 2: 4
If that is the only difference then I think maybe this is a bug report. The reason is that I'm imaging the main purpose of this tool is to compare SAM or BAM files from two different pipelines or sequencing runs and especially the latter would likely have a different read group ID. When one compares alignment files are they not mostly concerned with how comparable the actual alignments are?
Is there something else I am missing?
Many thanks!
Matt
REQUIRED for all errors and issues:
a) GATK version used:
gatk4-4.2.6.1
picard-2.23.4
b) Exact command used:
picard CompareSAMs old/DS-376302_old.hg19.bam new/DS-376302.hg19.bam O=CompareSams/DS-376302.tsv
c) Entire program log:
picard CompareSAMs old/DS-376302_old.hg19.bam new/DS-376302.hg19.bam O=CompareSams/DS-376302.tsv
INFO 2022-05-17 13:44:20 CompareSAMs
********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
********** CompareSAMs old/DS-376302_old.hg19.bam new/DS-376302.hg19.bam -O CompareSams/DS-376302.tsv
**********
13:44:21.236 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/linuxbrew/.linuxbrew/Cellar/picard-tools/2.23.4/libexec/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Tue May 17 13:44:21 PDT 2022] CompareSAMs old/DS-376302_old.hg19.bam new/DS-376302.hg19.bam OUTPUT=CompareSams/DS-376302.tsv LENIENT_HEADER=false LENIENT_DUP=false LENIENT_LOW_MQ_ALIGNMENT=false LENIENT_UNKNOWN_MQ_ALIGNMENT=false LOW_MQ_THRESHOLD=3 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Tue May 17 13:44:21 PDT 2022] Executing as msnyder5@seapl0app024 on Linux 3.8.13-118.2.2.el6uek.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_242-b08; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.23.4
Read Group ID differs.
File 1: JDPHV_S1_L001_aa
File 2: JDPHV_S1_L001
Number of program records differs.
File 1: 2
File 2: 4
SAM files differ.
[Tue May 17 13:44:46 PDT 2022] picard.sam.CompareSAMs done. Elapsed time: 0.43 minutes.
Runtime.totalMemory()=1907884032
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
-
Thank you for your post, Matt Snyder! I want to let you know we have received your question and will be moving it to the Community Discussions -> General Discussion topic, as the Other topic is for reporting bugs and issues with GATK.
We'll get back to you if we have any updates or follow up questions. Please see our Support Policy for more details about how we prioritize responding to questions.
-
Hi Matt Snyder,
I'm wondering if you have tried or are aware of the --LENIENT_HEADER option in CompareSAMs? This option allows for the headers to be different in certain fields.
Let me know if that helps with your comparisons!
Best,
Genevieve
-
Hi Genevieve Brandt (she/her),
Sorry, I know this issue is ancient. I did use `--LENIENT_HEADER true` and just re-tried with Version:2.27.4-SNAPSHOT and got the same result. MAPPINGS_DIFFER, UNMAPPED_LEFT, UNMAPPED_RIGHT, MISSING_LEFT, MISSING_RIGHT, and DUPLICATE_MARKINGS_DIFFER are all zero, but ARE_EQUAL is "N". When I use the --LENIENT_HEADER argument, I no longer see "Number of program records differs" in the output log, but I still see "Read Group ID differs". Since the read group is present in the header as well as the body of the BAM, I wonder if this is causing the issue... So maybe this is really a feature request for a --LENIENT_READ_GROUP argument? Again, as I stated before, when we compare SAM/BAM files, I think we are most often concerned with the difference in the mappings, which is reflected in the output table, and often not as interested in read groups.
Thanks!
Please sign in to leave a comment.
3 comments