Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

MarkDuplicates (Picard) Follow

4 comments

  • Avatar
    JAY SINGH

    What should be the "--ASSUME_SORTED" option if the bam file is sorted by query name?

    0
    Comment actions Permalink
  • Avatar
    Damian Loska

    are you sure you've set default COMPRESSION_LEVEL to 5?

    0
    Comment actions Permalink
  • Avatar
    Phoebe Magdy

    Cannot find Output files after applying Markduplicates with picard tools

    I've some sorted bam files and i wanted to mark the duplicate reads using MarkDuplicate in picard tool:
    all files are present in a directory named  `AlignmentOfTrimmed_Sam_Files` the whole path for these files is defined below, and this is my current working directory, 
    After running this command several times which takes an hour each time and with minor changes each time I was never able to find the output files 

    Any suggestions to help??  
    And thanks in advance
       
    ```
    ### Path of the directory where sorted bam files are located:

    samfiles_dir = '/media/phmagdy/TOSHIBA_EXT/PhD_Data_Analysis/group3/AlignmentOfTrimmed_Sam_Files/'

    ### Loop over sorted bam files and markduplicates using picard tools 

    for file in os.listdir(samfiles_dir):
        if file.endswith('sorted.bam'):
            inputfile = os.path.join(samfiles_dir,file)
            fileBasename = '_'.join(os.path.basename(file).rsplit('_',4)[0:3])
            !java  -Xmx20g -jar {picard_path}/picard.jar MarkDuplicates --INPUT {inputfile} \
            --OUTPUT {fileBasename}.markdup.bam \
            --METRICS_FILE {fileBasename}.metrics.txt
    ```

    here is a part of the output :
    ```
    MarkDuplicates starts at 2022-09-18 16:07:52.296874
    16:07:53.413 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/phmagdy/miniconda3/envs/Jhm/share/picard-2.27.4-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so
    [Sun Sep 18 16:07:53 EET 2022] MarkDuplicates --INPUT /media/phmagdy/TOSHIBA_EXT/PhD_Data_Analysis/group3/AlignmentOfTrimmed_Sam_Files/S000021_S5424Nr_7_sorted.bam --OUTPUT S000021_S5424Nr_7.markdup.bam --METRICS_FILE S000021_S5424Nr_7.metrics.txt --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --FLOW_MODE false --FLOW_QUALITY_SUM_STRATEGY false --USE_END_IN_UNPAIRED_READS false --USE_UNPAIRED_CLIPPED_END false --UNPAIRED_END_UNCERTAINTY 0 --FLOW_SKIP_FIRST_N_FLOWS 0 --FLOW_Q_IS_KNOWN_END false --FLOW_EFFECTIVE_QUALITY_THRESHOLD 15 --ADD_PG_TAG_TO_READS true --REMOVE_DUPLICATES false --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
    [Sun Sep 18 16:07:53 EET 2022] Executing as phmagdy@ubuntu on Linux 5.15.0-46-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_112-b16; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: Version:2.27.4-SNAPSHOT
    INFO    2022-09-18 16:07:53    MarkDuplicates    Start of doWork freeMemory: 208248760; totalMemory: 221249536; maxMemory: 19088801792
    INFO    2022-09-18 16:07:53    MarkDuplicates    Reading input file and constructing read end information.
    INFO    2022-09-18 16:07:53    MarkDuplicates    Will retain up to 69162325 data points before spilling to disk.
    INFO    2022-09-18 16:08:00    MarkDuplicates    Read     1,000,000 records.  Elapsed time: 00:00:06s.  Time for last 1,000,000:    6s.  Last read position: chr1:16,264,133
    INFO    2022-09-18 16:08:00    MarkDuplicates    Tracking 3899 as yet unmatched pairs. 422 records in RAM.
    INFO    2022-09-18 16:08:05    MarkDuplicates    Read     2,000,000 records.  Elapsed time: 00:00:11s.  Time for last 

    ```

    N.B. there was no error at the end of the execution after almost one hour ... and here are the last few lines

    INFO    2022-09-18 14:58:24    MarkDuplicates    Read    41,000,000 records.  Elapsed time: 00:03:19s.  Time for last 1,000,000:    3s.  Last read position: chr8:107,782,217
    INFO    2022-09-18 14:58:24    MarkDuplicates    Tracking 114840 as yet unmatched pairs. 2544 records in RAM.
    INFO    2022-09-18 14:59:01    MarkDuplicates    Read    42,000,000 records.  Elapsed time: 00:03:57s.  Time for last 1,000,000:   37s.  Last read position: chr9:2,718,932
    INFO    2022-09-18 14:59:01    MarkDuplicates    Tracking 114824 as yet unmatched pairs. 9314 records in RAM.
    INFO    2022-09-18 14:59:57    MarkDuplicates    Read    43,000,000 records.  Elapsed time: 00:04:52s.  Time for last 1,000,000:   55s.  Last read position: chr9:66,499,605
    INFO    2022-09-18 14:59:57    MarkDuplicates    Tracking 114507 as yet unmatched pairs. 6658 records in RAM.
    INFO    2022-09-18 15:00:02    MarkDuplicates    Read    44,000,000 records.  Elapsed time: 00:04:57s.  Time for last 1,000,000:    4s.  Last read position: chr9:107,578,518
    INFO    2022-09-18 15:00:02    MarkDuplicates    Tracking 113906 as yet unmatched pairs. 3393 records in RAM.
    Time elapsed = 0:57:49.228557 

     

    1
    Comment actions Permalink
  • Avatar
    Xianying Cheng

    Hi team,

    Do you have any suggestion about --OPTICAL_DUPLICATE_PIXEL_DISTANCE when analyzing Novaseq X and NovaSeq 6000?

    Thank you

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk