Collecting alignment metrics at READ_GROUP level with Picard
I am trying to run the Picard CollectMultipleMetrics command on a CRAM file to obtain alignment metrics at a read group level, using the accumulation level flag and the alignment metrics program:
-METRIC_ACCUMULATION_LEVEL READ_GROUP
-PROGRAM CollectAlignmentSummaryMetrics
The CRAM was aligned to the hg38 reference from multiple read groups. The different read groups can be seen in the CRAM header:
samtools view -H sample123.cram
...
@RG ID:sample123 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-28923D42 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-3CEDF6CF LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-5885F29A LB:LB0 PL:PL0 PU:PU0 SM:sample123
...
I was expecting that the metrics would be accumulated for each RG ID present in the header.
However, it seems that the metrics are accumulated based on the PU (Platform Unit?) of each read group. Since each group has the same PU - "PU0" - the metrics for all read groups are being collapsed together and compressed into just one set of metrics, for "PU0" read groups.
See below for the output. To me, it looks like the "PU" field is being used as the "READ_GROUP" ID.
Should I expect to see alignment metrics for each RG ID in the header, or is it normal for the program to group reads based on the PU field? If I did want to get alignment metrics for each different read group that went into the CRAM, how can this be done?
## htsjdk.samtools.metrics.StringHeader
# CollectMultipleMetrics INPUT=/io/batch/4e6e78/sample123.cram ASSUME_SORTED=true OUTPUT=/io/batch/4e6e78/prefix METRIC_ACCUMULATION_LEVEL=[READ_GROUP] PROGRAM=[CollectAlignmentSummaryMetrics, CollectInsertSizeMetrics, MeanQualityByCycle, CollectBaseDistributionByCycle, CollectQualityYieldMetrics] VALIDATION_STRINGENCY=SILENT REFERENCE_SEQUENCE=/io/batch/4e6e78/inputs/0v8nr/Homo_sapiens_assembly38_masked.fasta STOP_AFTER=0 INCLUDE_UNPAIRED=false VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
## htsjdk.samtools.metrics.StringHeader
# Started on: Fri Sep 20 02:06:35 UTC 2024
## METRICS CLASS picard.analysis.AlignmentSummaryMetrics
CATEGORY TOTAL_READS PF_READS PCT_PF_READS PF_NOISE_READS PF_READS_ALIGNED PCT_PF_READS_ALIGNED PF_ALIGNED_BASES PF_HQ_ALIGNED_READS PF_HQ_ALIGNED_BASES PF_HQ_ALIGNED_Q20_BASES PF_HQ_MEDIAN_MISMATCHES PF_MISMATCH_RATE PF_HQ_ERROR_RATE PF_INDEL_RATE MEAN_READ_LENGTH SD_READ_LENGTH MEDIAN_READ_LENGTH MAD_READ_LENGTH MIN_READ_LENGTH MAX_READ_LENGTH MEAN_ALIGNED_READ_LENGTH READS_ALIGNED_IN_PAIRS PCT_READS_ALIGNED_IN_PAIRS PF_READS_IMPROPER_PAIRS PCT_PF_READS_IMPROPER_PAIRS BAD_CYCLES STRAND_BALANCE PCT_CHIMERAS PCT_ADAPTER PCT_SOFTCLIP PCT_HARDCLIP AVG_POS_3PRIME_SOFTCLIP_LENGTH SAMPLE LIBRARY READ_GROUP
FIRST_OF_PAIR 244114554 244114554 1 0 243578766 0.997805 35830086505 223433790 32942266262 32484341761 0 0.003247 0.002252 0.00024 148 0 148 0 148 148 146.822344 242498638 0.995566 5273335 0.021649 0 0.50016 0.018886 0.000021 0.005762 0 36.276036 sample123 LB0 PU0
SECOND_OF_PAIR 244114554 244114554 1 0 242514944 0.993447 35411909816 221476505 32535845956 31197346891 0 0.005488 0.00445 0.00023 148 0 148 0 148 148 145.106691 242498638 0.999933 4209513 0.017358 0 0.499727 0.019563 0.000001 0.012997 0 40.307926 sample123 LB0 PU0
PAIR 488229108 488229108 1 0 486093710 0.995626 71241996321 444910295 65478112218 63681688652 0 0.004361 0.003344 0.000235 148 0 148 0 148 148 145.964517 484997276 0.997744 9482848 0.019508 0 0.499944 0.019224 0.000011 0.009379 0 38.920755 sample123 LB0 PU0
REQUIRED for all errors and issues:
a) GATK version used: 4.6.0 / Picard 3.2.0
b) Exact command used:
CollectMultipleMetrics -INPUT /io/sample123.cram -REFERENCE_SEQUENCE /io/Homo_sapiens_assembly38_masked.fasta -OUTPUT /io/prefix -ASSUME_SORTED True -PROGRAM null -VALIDATION_STRINGENCY SILENT -PROGRAM CollectAlignmentSummaryMetrics -PROGRAM CollectInsertSizeMetrics -PROGRAM MeanQualityByCycle -PROGRAM CollectBaseDistributionByCycle -PROGRAM CollectQualityYieldMetrics -METRIC_ACCUMULATION_LEVEL null -METRIC_ACCUMULATION_LEVEL READ_GROUP
c) Entire program log (Filled with ProcessExecutor warnings but I'm not sure they're relevant)
INFO 2024-09-20 02:06:35 CollectMultipleMetrics
INFO 2024-09-20 02:06:35 RExecutor Executing R script via command: Rscript /tmp/script16343948897468350944.R
ERROR 2024-09-20 02:06:35 ProcessExecutor During startup - Warning messages:
ERROR 2024-09-20 02:06:35 ProcessExecutor 1: Setting LC_CTYPE failed, using "C"
ERROR 2024-09-20 02:06:35 ProcessExecutor 2: Setting LC_COLLATE failed, using "C"
ERROR 2024-09-20 02:06:35 ProcessExecutor 3: Setting LC_TIME failed, using "C"
ERROR 2024-09-20 02:06:35 ProcessExecutor 4: Setting LC_MESSAGES failed, using "C"
ERROR 2024-09-20 02:06:35 ProcessExecutor 5: Setting LC_MONETARY failed, using "C"
ERROR 2024-09-20 02:06:35 ProcessExecutor 6: Setting LC_PAPER failed, using "C"
ERROR 2024-09-20 02:06:35 ProcessExecutor 7: Setting LC_MEASUREMENT failed, using "C"
INFO 2024-09-20 02:06:35 ProcessExecutor [1] "Checking if R is installed"
02:06:35.645 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/root/micromamba/share/picard-3.2.0-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Fri Sep 20 02:06:35 UTC 2024] CollectMultipleMetrics INPUT=/io/batch/4e6e78/sample123.cram ASSUME_SORTED=true OUTPUT=/io/batch/4e6e78/prefix METRIC_ACCUMULATION_LEVEL=[READ_GROUP] PROGRAM=[CollectAlignmentSummaryMetrics, CollectInsertSizeMetrics, MeanQualityByCycle, CollectBaseDistributionByCycle, CollectQualityYieldMetrics] VALIDATION_STRINGENCY=SILENT REFERENCE_SEQUENCE=/io/batch/4e6e78/inputs/0v8nr/Homo_sapiens_assembly38_masked.fasta STOP_AFTER=0 INCLUDE_UNPAIRED=false VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Fri Sep 20 02:06:35 UTC 2024] Executing as root@hostname-ec03d77033 on Linux 6.5.0-1023-gcp amd64; OpenJDK 64-Bit Server VM 22.0.1-internal-adhoc.conda.src; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: 3.2.0-1-g3948afb6b
WARNING 2024-09-20 02:06:35 CollectMultipleMetrics The MeanQualityByCycle program does not support a metric accumulation level, but METRIC_ACCUMULATION_LEVEL was overridden in the command line. MeanQualityByCycle will be run against the entire input.
WARNING 2024-09-20 02:06:35 CollectMultipleMetrics The CollectBaseDistributionByCycle program does not support a metric accumulation level, but METRIC_ACCUMULATION_LEVEL was overridden in the command line. CollectBaseDistributionByCycle will be run against the entire input.
WARNING 2024-09-20 02:06:35 CollectMultipleMetrics The CollectQualityYieldMetrics program does not support a metric accumulation level, but METRIC_ACCUMULATION_LEVEL was overridden in the command line. CollectQualityYieldMetrics will be run against the entire input.
ERROR 2024-09-20 02:06:36 CollectAlignmentSummaryMetrics ReadLength histogram is calculated on all reads only, but ALL_READS were not included in the Metric Accumulation Levels. Histogram will not be generated.
INFO 2024-09-20 02:07:02 SinglePassSamProgram Processed 1,000,000 records. Elapsed time: 00:00:25s. Time for last 1,000,000: 22s. Last read position: chr1:7,174,235
INFO 2024-09-20 02:07:23 SinglePassSamProgram Processed 2,000,000 records. Elapsed time: 00:00:46s. Time for last 1,000,000: 20s. Last read position: chr1:13,807,623
...
INFO 2024-09-20 03:57:47 SinglePassSamProgram Processed 489,000,000 records. Elapsed time: 01:51:10s. Time for last 1,000,000: 19s. Last read position: chrUn_JTFH01001929v1_decoy:599
INFO 2024-09-20 03:58:07 SinglePassSamProgram Processed 490,000,000 records. Elapsed time: 01:51:31s. Time for last 1,000,000: 20s. Last read position: */*
INFO 2024-09-20 03:59:01 RExecutor Executing R script via command: Rscript /tmp/script9948235460342383020.R /io/batch/4e6e78/prefix.insert_size_metrics /io/batch/4e6e78/prefix.insert_size_histogram.pdf sample123.cram
ERROR 2024-09-20 03:59:01 ProcessExecutor During startup - Warning messages:
ERROR 2024-09-20 03:59:01 ProcessExecutor 1: Setting LC_CTYPE failed, using "C"
ERROR 2024-09-20 03:59:01 ProcessExecutor 2: Setting LC_COLLATE failed, using "C"
ERROR 2024-09-20 03:59:01 ProcessExecutor 3: Setting LC_TIME failed, using "C"
ERROR 2024-09-20 03:59:01 ProcessExecutor 4: Setting LC_MESSAGES failed, using "C"
ERROR 2024-09-20 03:59:01 ProcessExecutor 5: Setting LC_MONETARY failed, using "C"
ERROR 2024-09-20 03:59:01 ProcessExecutor 6: Setting LC_PAPER failed, using "C"
ERROR 2024-09-20 03:59:01 ProcessExecutor 7: Setting LC_MEASUREMENT failed, using "C"
INFO 2024-09-20 03:59:01 ProcessExecutor null device
INFO 2024-09-20 03:59:01 ProcessExecutor 1
INFO 2024-09-20 03:59:01 RExecutor Executing R script via command: Rscript /tmp/script1233926558258630103.R /io/batch/4e6e78/prefix.quality_by_cycle_metrics /io/batch/4e6e78/prefix.quality_by_cycle.pdf sample123.cram
ERROR 2024-09-20 03:59:01 ProcessExecutor During startup - Warning messages:
ERROR 2024-09-20 03:59:01 ProcessExecutor 1: Setting LC_CTYPE failed, using "C"
ERROR 2024-09-20 03:59:01 ProcessExecutor 2: Setting LC_COLLATE failed, using "C"
ERROR 2024-09-20 03:59:01 ProcessExecutor 3: Setting LC_TIME failed, using "C"
ERROR 2024-09-20 03:59:01 ProcessExecutor 4: Setting LC_MESSAGES failed, using "C"
ERROR 2024-09-20 03:59:01 ProcessExecutor 5: Setting LC_MONETARY failed, using "C"
ERROR 2024-09-20 03:59:01 ProcessExecutor 6: Setting LC_PAPER failed, using "C"
ERROR 2024-09-20 03:59:01 ProcessExecutor 7: Setting LC_MEASUREMENT failed, using "C"
INFO 2024-09-20 03:59:01 ProcessExecutor null device
INFO 2024-09-20 03:59:01 ProcessExecutor 1
INFO 2024-09-20 03:59:02 RExecutor Executing R script via command: Rscript /tmp/script16334611869075646450.R /io/batch/4e6e78/prefix.base_distribution_by_cycle_metrics /io/batch/4e6e78/prefix.base_distribution_by_cycle.pdf sample123.cram
ERROR 2024-09-20 03:59:02 ProcessExecutor During startup - Warning messages:
ERROR 2024-09-20 03:59:02 ProcessExecutor 1: Setting LC_CTYPE failed, using "C"
ERROR 2024-09-20 03:59:02 ProcessExecutor 2: Setting LC_COLLATE failed, using "C"
ERROR 2024-09-20 03:59:02 ProcessExecutor 3: Setting LC_TIME failed, using "C"
ERROR 2024-09-20 03:59:02 ProcessExecutor 4: Setting LC_MESSAGES failed, using "C"
ERROR 2024-09-20 03:59:02 ProcessExecutor 5: Setting LC_MONETARY failed, using "C"
ERROR 2024-09-20 03:59:02 ProcessExecutor 6: Setting LC_PAPER failed, using "C"
ERROR 2024-09-20 03:59:02 ProcessExecutor 7: Setting LC_MEASUREMENT failed, using "C"
INFO 2024-09-20 03:59:02 ProcessExecutor null device
INFO 2024-09-20 03:59:02 ProcessExecutor 1
[Fri Sep 20 03:59:02 UTC 2024] picard.analysis.CollectMultipleMetrics done. Elapsed time: 112.45 minutes.
Runtime.totalMemory()=6501171200
-
Can you share the @RG sections of your file header with us?
-
Gökalp Çelik sure - this aligned genome is comprised of 76 read groups
@RG ID:sample123 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-28923D42 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-3CEDF6CF LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-5885F29A LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-3520DFFE LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-366A9FB8 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-4F7A0D5C LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-115C8711 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-18F99BFC LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-120B2C3F LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-3E8D6E03 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-4AD9D46B LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-1E7884DB LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-757E36E4 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-62FBF36A LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-1F664086 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-730D5BF9 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-67B3974F LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-72E1B077 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-3B8C00F6 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-27649607 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-2FBB9824 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-593E25C1 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-583E3F74 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-3DE454D0 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-321B27FF LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-17E95D28 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-782072ED LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-6ED91621 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-7D479A11 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-6B8118C LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-2599CB31 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-156E0DC9 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-6C66E462 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-497B28A2 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-3A1A2D3F LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-55AB40C8 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-2E3DE9AB LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-6FB3E4EB LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-505DDE31 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-1877978D LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-6D60FCD7 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-2C7955C7 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-12529AD1 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-46273D5F LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-3FE36C1F LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-4944ABAE LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-7C2479C0 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-578EC2E6 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-278981C9 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-5F493686 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-338A9647 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-40999CB LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-75850883 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-42C1F418 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-754B21C4 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-317E9F82 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-11A70168 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-A59B4A5 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-39BC19BB LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-414F8D31 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-5E51D997 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-68A3FB9E LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-2861807E LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-1F6842B1 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-6FF60D74 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-7AA3E279 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-3D539683 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-4A7102B9 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-361FC785 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-56BB63A6 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-196F3249 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-2E9AD623 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-38B6A8F1 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-B3093A9 LB:LB0 PL:PL0 PU:PU0 SM:sample123
@RG ID:sample123-6F17F4EE LB:LB0 PL:PL0 PU:PU0 SM:sample123 -
Hi again.
When we checked the SAM spec and the requirements for a RG tag it looks like there can be 2 separate fields that are supposed to be unique per read group. One is ID and the other one is PU. However SAM Spec also mentiones that ID can be modified to prevent collisions within sample during merging but two samples may also share the same ID tag therefore even 2 samples are different RG IDs could be the same. PU on the other hand is a unique field and is never touched during any SAM/BAM operation. Due to this rationale PU is selected as a way to distinguish between read groups within the code. Since all your PU values are the same metric collection cannot distinguish between different read groups and collects all metrics for a single PU.
If you wish to get different metrics for each read group we suggest you to modify your RG tags using AddOrReplaceReadGroups tool.
I hope this helps.
Regards.
-
This has been very helpful, thank you very much Gökalp Çelik!
Ed
Please sign in to leave a comment.
4 comments