Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

AbstractOpticalDuplicateFinderCommandLineProgram Default READ_NAME_REGEX '<optimized capture of last three': 'separated fields as numeric values>' did not match read name '2hpf_wt_total_SRR870747.42096'.

Answered
0

11 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi Dina Tzur,

    Could you share the complete program log from MarkDuplicates?

    Thank you,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Dina Tzur

    Hi Genevieve Brandt,

    Thanks for the attention!

    Can you explain to me how I find it?

    Thanks again!

    Dina

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Yes! The program log is all the messages printed to your terminal when running the GATK command line.

    Here is another post showing the complete program log from CreateReadCountPanelOfNormals: https://gatk.broadinstitute.org/hc/en-us/community/posts/4417744373019-Error-while-running-CreateReadCountPanelOfNormals

    0
    Comment actions Permalink
  • Avatar
    Dina Tzur

    [bam_sort_core] merging from 10 files and 1 in-memory blocks...
    INFO    2022-02-16 21:09:14     MarkDuplicates

    ********** NOTE: Picard's command line syntax is changing.
    **********
    ********** For more information, please see:
    ********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
    **********
    ********** The command line looks like this in the new syntax:
    **********
    **********    MarkDuplicates -I /sci/home/dinnatzur12/group/BAM_z/BAM_Lee2013/bam_bai/2.0hpf_wt_total.star.bam.sort.bam -O /sci/home/dinnatzur12/group/BAM_z/BAM_Lee2013/md/2.0hpf_wt_total.star.bam.MD.bam -M /sci/home/dinnatzur12/group/BAM_z/BAM_Lee2013/md/2.0hpf_wt_total.star.bam.MD_matrix.txt
    **********


    21:09:15.667 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/usr/local/hurcs/miniconda3/envs/picard-2.26.4/share/picard-2.26.4-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so
    [Wed Feb 16 21:09:15 IST 2022] MarkDuplicates INPUT=[/sci/home/dinnatzur12/group/BAM_z/BAM_Lee2013/bam_bai/2.0hpf_wt_total.star.bam.sort.bam] OUTPUT=/sci/home/dinnatzur12/group/BAM_z/BAM_Lee2013/md/2.0hpf_wt_total.star.bam.MD.bam METRICS_FILE=/sci/home/dinnatzur12/group/BAM_z/BAM_Lee2013/md/2.0hpf_wt_total.star.bam.MD_matrix.txt    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
    [Wed Feb 16 21:09:15 IST 2022] Executing as dinnatzur12@glacier-06 on Linux 5.10.79-aufs-1 amd64; OpenJDK 64-Bit Server VM 11.0.9.1-internal+0-adhoc..src; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.26.4
    INFO    2022-02-16 21:09:15     MarkDuplicates  Start of doWork freeMemory: 510616376; totalMemory: 519110656; maxMemory: 2075918336
    INFO    2022-02-16 21:09:15     MarkDuplicates  Reading input file and constructing read end information.
    INFO    2022-02-16 21:09:15     MarkDuplicates  Will retain up to 7521443 data points before spilling to disk.
    WARNING 2022-02-16 21:09:15     AbstractOpticalDuplicateFinderCommandLineProgram        Default READ_NAME_REGEX '<optimized capture of last three ':' separated fields as numeric values>' did not match read name '2hpf_wt_total_SRR870747.420964'.  You may need to specify a READ_NAME_REGEX in order to correctly identify optical duplicates.  Note that this message will not be emitted again even if other read names do not match the regex.
    INFO    2022-02-16 21:09:19     MarkDuplicates  Read     1,000,000 records.  Elapsed time: 00:00:03s.  Time for last 1,000,000:    3s.  Last read position: chr4:55,758,217
    INFO    2022-02-16 21:09:19     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:21     MarkDuplicates  Read     2,000,000 records.  Elapsed time: 00:00:05s.  Time for last 1,000,000:    1s.  Last read position: chr4:77,551,161
    INFO    2022-02-16 21:09:21     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:22     MarkDuplicates  Read     3,000,000 records.  Elapsed time: 00:00:07s.  Time for last 1,000,000:    1s.  Last read position: chr4:77,557,063
    INFO    2022-02-16 21:09:22     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:24     MarkDuplicates  Read     4,000,000 records.  Elapsed time: 00:00:08s.  Time for last 1,000,000:    1s.  Last read position: chr4:77,558,095
    INFO    2022-02-16 21:09:24     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:26     MarkDuplicates  Read     5,000,000 records.  Elapsed time: 00:00:10s.  Time for last 1,000,000:    1s.  Last read position: chr4:77,558,806
    INFO    2022-02-16 21:09:26     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:27     MarkDuplicates  Read     6,000,000 records.  Elapsed time: 00:00:11s.  Time for last 1,000,000:    1s.  Last read position: chr4:77,559,415
    INFO    2022-02-16 21:09:27     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:29     MarkDuplicates  Read     7,000,000 records.  Elapsed time: 00:00:13s.  Time for last 1,000,000:    1s.  Last read position: chr4:77,560,710
    INFO    2022-02-16 21:09:29     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:31     MarkDuplicates  Read     8,000,000 records.  Elapsed time: 00:00:15s.  Time for last 1,000,000:    1s.  Last read position: chr4:77,563,046
    INFO    2022-02-16 21:09:31     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:33     MarkDuplicates  Read     9,000,000 records.  Elapsed time: 00:00:17s.  Time for last 1,000,000:    1s.  Last read position: chr7:30,627,338
    INFO    2022-02-16 21:09:33     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:37     MarkDuplicates  Read    10,000,000 records.  Elapsed time: 00:00:21s.  Time for last 1,000,000:    4s.  Last read position: chr12:17,156,783
    INFO    2022-02-16 21:09:37     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:39     MarkDuplicates  Read    11,000,000 records.  Elapsed time: 00:00:23s.  Time for last 1,000,000:    2s.  Last read position: chr18:3,576,345
    INFO    2022-02-16 21:09:39     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:41     MarkDuplicates  Read    12,000,000 records.  Elapsed time: 00:00:25s.  Time for last 1,000,000:    1s.  Last read position: chr23:22,507,823
    INFO    2022-02-16 21:09:41     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:43     MarkDuplicates  Read    13,000,000 records.  Elapsed time: 00:00:27s.  Time for last 1,000,000:    1s.  Last read position: KZ115963.1:573
    INFO    2022-02-16 21:09:43     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:44     MarkDuplicates  Read    14,000,000 records.  Elapsed time: 00:00:28s.  Time for last 1,000,000:    1s.  Last read position: KZ115098.1:437
    INFO    2022-02-16 21:09:44     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:45     MarkDuplicates  Read    15,000,000 records.  Elapsed time: 00:00:29s.  Time for last 1,000,000:    1s.  Last read position: KZ115098.1:6,441
    INFO    2022-02-16 21:09:45     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:47     MarkDuplicates  Read    16,000,000 records.  Elapsed time: 00:00:31s.  Time for last 1,000,000:    1s.  Last read position: KZ115098.1:7,937
    INFO    2022-02-16 21:09:47     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:48     MarkDuplicates  Read    17,000,000 records.  Elapsed time: 00:00:32s.  Time for last 1,000,000:    1s.  Last read position: KZ115098.1:8,612
    INFO    2022-02-16 21:09:48     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:49     MarkDuplicates  Read    18,000,000 records.  Elapsed time: 00:00:33s.  Time for last 1,000,000:    1s.  Last read position: KZ115098.1:9,285
    INFO    2022-02-16 21:09:49     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:50     MarkDuplicates  Read    19,000,000 records.  Elapsed time: 00:00:35s.  Time for last 1,000,000:    1s.  Last read position: KZ115098.1:10,082
    INFO    2022-02-16 21:09:50     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:52     MarkDuplicates  Read    20,000,000 records.  Elapsed time: 00:00:36s.  Time for last 1,000,000:    1s.  Last read position: KZ115098.1:12,646
    INFO    2022-02-16 21:09:52     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:53     MarkDuplicates  Read    21,000,000 records.  Elapsed time: 00:00:37s.  Time for last 1,000,000:    1s.  Last read position: KZ114841.1:92,490
    INFO    2022-02-16 21:09:53     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:55     MarkDuplicates  Read    22,000,000 records.  Elapsed time: 00:00:39s.  Time for last 1,000,000:    1s.  Last read position: 18S_1716nt:946
    INFO    2022-02-16 21:09:55     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:56     MarkDuplicates  Read    23,000,000 records.  Elapsed time: 00:00:40s.  Time for last 1,000,000:    1s.  Last read position: 18S_1946nt:893
    INFO    2022-02-16 21:09:56     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:57     MarkDuplicates  Read    24,000,000 records.  Elapsed time: 00:00:41s.  Time for last 1,000,000:    1s.  Last read position: 28S_4252nt:325
    INFO    2022-02-16 21:09:57     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:09:58     MarkDuplicates  Read    25,000,000 records.  Elapsed time: 00:00:42s.  Time for last 1,000,000:    1s.  Last read position: 28S_4252nt:1,149
    INFO    2022-02-16 21:09:58     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:10:00     MarkDuplicates  Read    26,000,000 records.  Elapsed time: 00:00:44s.  Time for last 1,000,000:    1s.  Last read position: 28S_4252nt:1,763
    INFO    2022-02-16 21:10:00     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:10:01     MarkDuplicates  Read    27,000,000 records.  Elapsed time: 00:00:45s.  Time for last 1,000,000:    1s.  Last read position: 28S_4252nt:2,430
    INFO    2022-02-16 21:10:01     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:10:02     MarkDuplicates  Read    28,000,000 records.  Elapsed time: 00:00:46s.  Time for last 1,000,000:    1s.  Last read position: 28S_4252nt:3,990
    INFO    2022-02-16 21:10:02     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:10:03     MarkDuplicates  Read    29,000,000 records.  Elapsed time: 00:00:48s.  Time for last 1,000,000:    1s.  Last read position: 28S_4278nt:812
    INFO    2022-02-16 21:10:03     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:10:05     MarkDuplicates  Read    30,000,000 records.  Elapsed time: 00:00:49s.  Time for last 1,000,000:    1s.  Last read position: 28S_4278nt:1,434
    INFO    2022-02-16 21:10:05     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:10:06     MarkDuplicates  Read    31,000,000 records.  Elapsed time: 00:00:50s.  Time for last 1,000,000:    1s.  Last read position: 28S_4278nt:2,135
    INFO    2022-02-16 21:10:06     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:10:07     MarkDuplicates  Read    32,000,000 records.  Elapsed time: 00:00:51s.  Time for last 1,000,000:    1s.  Last read position: 28S_4278nt:3,141
    INFO    2022-02-16 21:10:07     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2022-02-16 21:10:08     MarkDuplicates  Read 32675474 records. 0 pairs never matched.
    INFO    2022-02-16 21:10:09     MarkDuplicates  After buildSortedReadEndLists freeMemory: 765922192; totalMemory: 837369856; maxMemory: 2075918336
    INFO    2022-02-16 21:10:09     MarkDuplicates  Will retain up to 64872448 duplicate indices before spilling to disk.
    INFO    2022-02-16 21:10:09     MarkDuplicates  Traversing read pair information and detecting duplicates.
    INFO    2022-02-16 21:10:09     MarkDuplicates  Traversing fragment information and detecting duplicates.
    INFO    2022-02-16 21:10:11     MarkDuplicates  Sorting list of duplicate records.
    INFO    2022-02-16 21:10:12     MarkDuplicates  After generateDuplicateIndexes freeMemory: 929069616; totalMemory: 1462607872; maxMemory: 2075918336
    INFO    2022-02-16 21:10:12     MarkDuplicates  Marking 7276294 records as duplicates.
    INFO    2022-02-16 21:10:12     MarkDuplicates  Found 0 optical duplicate clusters.
    INFO    2022-02-16 21:10:12     MarkDuplicates  Reads are assumed to be ordered by: coordinate
    INFO    2022-02-16 21:11:21     MarkDuplicates  Written    10,000,000 records.  Elapsed time: 00:01:09s.  Time for last 10,000,000:   69s.  Last read position: chr12:17,156,783
    INFO    2022-02-16 21:12:29     MarkDuplicates  Written    20,000,000 records.  Elapsed time: 00:02:17s.  Time for last 10,000,000:   67s.  Last read position: KZ115098.1:12,646
    INFO    2022-02-16 21:13:33     MarkDuplicates  Written    30,000,000 records.  Elapsed time: 00:03:20s.  Time for last 10,000,000:   63s.  Last read position: 28S_4278nt:1,434
    INFO    2022-02-16 21:13:50     MarkDuplicates  Writing complete. Closing input iterator.
    INFO    2022-02-16 21:13:50     MarkDuplicates  Duplicate Index cleanup.
    INFO    2022-02-16 21:13:50     MarkDuplicates  Getting Memory Stats.
    INFO    2022-02-16 21:13:50     MarkDuplicates  Before output close freeMemory: 1446677912; totalMemory: 1462607872; maxMemory: 2075918336
    INFO    2022-02-16 21:13:52     MarkDuplicates  Closed outputs. Getting more Memory Stats.
    INFO    2022-02-16 21:13:52     MarkDuplicates  After output close freeMemory: 1352831712; totalMemory: 1368248320; maxMemory: 2075918336
    [Wed Feb 16 21:13:52 IST 2022] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 4.62 minutes.
    Runtime.totalMemory()=1368248320

     

     

     

     

    Hope I understood correctly what to share.
    Thank you!
    Dina

     

    0
    Comment actions Permalink
  • Avatar
    Dina Tzur

    And the matrix I get:

    ## htsjdk.samtools.metrics.StringHeader
    # MarkDuplicates INPUT=[/group/BAM_z/BAM_Lee2013/bam_bai/2.0hpf_wt_total.star.bam.sort.bam] OUTPUT=/group/BAM_z/BAM_Lee2013/md/2.0hpf_wt_total.star.bam.MD.bam METRICS_FILE=/group/BAM_z/BAM_Lee2013/md/2.0hpf_wt_total.star.bam.MD_matrix.txt    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
    ## htsjdk.samtools.metrics.StringHeader
    # Started on: Wed Feb 16 21:09:15 IST 2022

    ## METRICS CLASS        picard.sam.DuplicationMetrics
    LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED     SECONDARY_OR_SUPPLEMENTARY_RDS  UNMAPPED_READS  UNPAIRED_READ_DUPLICATES        READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES     PERCENT_DUPLICATION     ESTIMATED_LIBRARY_SIZE
    Unknown Library 10503556        0       22171918        0       7276294 0       0       0.692746

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Dina,

    Yes, this is what I was looking for, thank you!

    This doesn't look like a problem with MarkDuplicates. Depending on your reads, everything may be fine. The reason that you only got one line in the metrics file is because MarkDuplicates is only detecting one library in this bam. If you are expecting that there should be more than one library, then you'll want to go back and make sure your pre-processing steps were done correctly.

    The warning you got could indicate that there is an issue in your BAM file with the read names. Here is an article we have about BAM/SAM files, take a look and make sure that your files meet the specifications: SAM or BAM or CRAM - Mapped sequence data formats. There's also a related forum post that could be helpful.

    Your metrics output also indicate that all of your reads were evaluated as unpaired, is this expected? If not, check your file for issues.

    Let me know if you have any further questions.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Dina Tzur

    Hi Genevieve,

     

    The truth is that I have run the MARKDUPLICATE already once on the files and got a matrix in this style:

     

    ## htsjdk.samtools.metrics.StringHeader
    # MarkDuplicates INPUT=[/sci/home/dinnatzur12/group/BAM_z/BAM_Pauli2011/bam_bai/2.5hpf.star.bam.sort.bam] OUTPUT=/sci/home/dinnatzur12/group/BAM_z/BAM_Pauli2011/md/2.5hpf.star.bam.MD.bam METRICS_FILE=/sci/home/dinnatzur12/group/BAM_z/BAM_Pauli2011/md/2.5hpf.star.bam.MD_matrix.txt    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
    ## htsjdk.samtools.metrics.StringHeader
    # Started on: Tue Feb 08 12:29:57 IST 2022

    ## METRICS CLASS        picard.sam.DuplicationMetrics
    LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED     SECONDARY_OR_SUPPLEMENTARY_RDS  UNMAPPED_READS  UNPAIRED_READ_DUPLICATES        READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES     PERCENT_DUPLICATION     ESTIMATED_LIBRARY_SIZE
    Unknown Library 0       164830805       46942564        0       0       36215207        47810   0.219711        318172211

    ## HISTOGRAM    java.lang.Double
    BIN     CoverageMult    all_sets        optical_sets    non_optical_sets
    1.0     1.000221        111529041       0       111561454
    2.0     1.596031        11186571        47676   11160566
    3.0     1.950942        2909758 59      2905831
    4.0     2.162354        1167906 4       1166793
    5.0     2.288288        586193  1       585659
    6.0     2.363304        337103  0       336840
    7.0     2.407989        212422  0       212295
    8.0     2.434607        144258  0       144126
    9.0     2.450463        101000  0       100931
    10.0    2.459908        73941   0       73916
    11.0    2.465534        56383   0       56347
    12.0    2.468885        43983   0       43956
    13.0    2.470882        34999   0       34975
    14.0    2.472071        28360   0       28344
    15.0    2.472779        23364   0       23339
    16.0    2.473201        19084   0       19077
    17.0    2.473453        16245   0       16245
    18.0    2.473602        13759   0       13756
    19.0    2.473691        11889   0       11875
    20.0    2.473745        10237   0       10219
    21.0    2.473776        9066    0       9077
    22.0    2.473795        7918    0       7899
    23.0    2.473806        7042    0       7045
    24.0    2.473813        6031    0       6032
    25.0    2.473817        5492    0       5491
    26.0    2.473819        5031    0       5024
    27.0    2.473821        4349    0       4344
    28.0    2.473822        4082    0       4083
    29.0    2.473822        3658    0       3651
    30.0    2.473822        3315    0       3328
    31.0    2.473823        3048    0       3039
    32.0    2.473823        2752    0       2760
    33.0    2.473823        2450    0       2441
    34.0    2.473823        2360    0       2353
    35.0    2.473823        2206    0       2217
    36.0    2.473823        2004    0       1993
    37.0    2.473823        1903    0       1901
    38.0    2.473823        1734    0       1733
    39.0    2.473823        1566    0       1568

     

    So I expected to get a matrix of this type and not with just one row.

     

    I thought the problem was the names of the readings and looked at the two link you provided (thank you very much!).
    According to the second link, I used the recommendation to edit the names of the readings and edited them for this:

    3:2:2:718:17    4       *       0       0       *       *       0       0       NGCTTTTAGGCGGGATTCTGACTTAGAGGCGTTCAGTCATAATCCCGCAG      #AAAFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ    YT:Z:UU
    3:2:2:718:18    4       *       0       0       *       *       0       0       NCGGGGCCTATCGGAGATCCGACGGCGCTGCTGTATCGTTGCTTTTAGGC      #AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ    YT:Z:UU
    3:2:2:718:19    4       *       0       0       *       *       0       0       NCCGAGGTCTTTTTTTTTTTTTTTAACTTTGCATTTACAGGAACGCTGCC      #AAFFJJJJJJJJJJJJJJJJJJJJJ-FJJ---A--7-<A-FAA7AA7A<    YT:Z:UU
    3:2:2:718:20    4       *       0       0       *       *       0       0       NCCGAGGTCTTTTTTTTTTTTTTTAACTTTGCATCTACAGGAACGCTGCC      #AAFFJJJJJJJJJJJJJJJJJJJF<<JJJ-<FJ<JF<-<FFFJ7AJ-7A    YT:Z:UU
    3:2:2:718:21    4       *       0       0       *       *       0       0       NGCAGTACGAATGCCCCCGTCTGTCTCTGTTAACCATTACCTCAAGTCCA      #AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ    YT:Z:UU
    3:2:2:718:22    16      chr24   24171147        100     50M     *       0       0       GCTGCCGGAGGACCCGAGGAGACGCAGCCTGTGGATGAAGTTTATCGAGN      JJF<JFJJJJAAJFJJJJJAJAFJAJJJ7JJJJJJFFJJJJJF-FAAAA#    AS:i:-1 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:49G0       YT:Z:UU ZW:f:1
    3:2:2:718:23    256     chr4    77551337        3       50M     *       0       0       NTCTGATAAATGCACGCGTCCCCGGGTACCCACCCCCCGCCCCGAGGGGA      #AAFFJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJ<JF7FJFF    AS:i:-1 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:0A49       YT:Z:UU ZW:f:0.5
    3:2:2:718:23    0       chr4    77562888        3       50M     *       0       0       NTCTGATAAATGCACGCGTCCCCGGGTACCCACCCCCCGCCCCGAGGGGA      #AAFFJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJ<JF7FJFF    AS:i:-1 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:0A49       YT:Z:UU ZW:f:0.5
    3:2:2:718:24    16      chr1    24558245        100     50M     *       0       0       AAGTTTTATAGTTGTTTTCTTTTATTTTCCTAATTATTTTACCAAAGCTN      JFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFFAA#    AS:i:-2 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:30A18G0    YT:Z:UU ZW:f:1
    3:2:2:718:25    16      chr20   29580097        100     50M     *       0       0       ACTGGCTCTCAACTTCTCTGTCTTCTACTATGAGATCCTTAACTCTCCGN      JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJFFAA#    AS:i:-1 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:49G0       YT:Z:UU ZW:f:1

     

    This caused me not to get the comment but I still only got one line in the matrix.

    So I would love if you could tell me what exactly the MARKDUPLICATE expects to find in the name of the call? Because even though I tried, I could not figure it out from the first link you gave (the expected format is not that detailed).

     

    Thanks for the patience and much help!
    Dina

     

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Dina Tzur,

    MarkDuplicates does not produce the histogram if there is more than one read group in the file. Here's a biostars post with a good explanation: https://www.biostars.org/p/115044/.

    Just in case you are not familiar, read groups are different than read names. We have an explanation in this article here: https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups

    So the difference you are seeing is not related to the read name warning. And it looks like the read names are matching the specifications now!

    Let me know if you have any other future questions.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Dina Tzur

    Hi Genevieve,

    Thank you very much!
    So it seems that MARKDUPLICATE works well for me?

    Dina

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Yes, it's working as expected.

    0
    Comment actions Permalink
  • Avatar
    Dina Tzur

    Thank you!

    Dina

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk