Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Big variance of PCT_BASES_20X between software versions

0

16 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi JonR, thank you for the question. Here is the link to the documentation for MarkDuplicates.

    The default option for --REMOVE_DUPLICATES is false, this is because GATK prioritizes keeping more information in case it is needed for your analysis. However, when you change the --REMOVE_DUPLICATES option to true, you are getting rid of duplicate reads, so it would make sense that it would change the PCT_TARGET_BASES_20X. 

    We cannot troubleshoot or test GATK3 because it is no longer supported, so we cannot provide a reason that the PCT_TARGET_BASES_20X did not change when you removed duplicates. 

    In terms of the preferred metric, it depends on your case. You can stick with the --REMOVE_DUPLICATES option as false, because it is the default. However, it depends on how you will use your data and if you will be primarily using GATK or using other platforms.

    0
    Comment actions Permalink
  • Avatar
    JonR

    Hello Genenvieve,

     

    Then compare gatk4-4.0.0.0-0 vs the newest version.  At 50x I'm getting 72% vs 6% coverage on one of my datasets.  That's a 66% difference.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi JonR, could you please give me the specific commands you used to get the 72% and 6% coverage?

    0
    Comment actions Permalink
  • Avatar
    JonR

     

    version 4.0.00.00

    gatk-launch CollectHsMetrics --TMP_DIR TEMP --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/HG02026_HPR_Rep2_Cap04_70Me_30M_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/HG02026_HPR_Rep2_Cap04_70Me_30M_target.interval_list --INPUT R3j_REMOVE_DUPLICATES/HG02026_HPR_Rep2_Cap04_70Me_30M.bam --OUTPUT GATK4/GATK400-removeDupTrue.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --REFERENCE_SEQUENCE /data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa -VALIDATION_STRINGENCY LENIENT --COVERAGE_CAP 100000

     

    gatk4-4.1.7.0-0

    gatk CollectHsMetrics --TMP_DIR TEMP --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/HG02026_HPR_Rep2_Cap04_70Me_30M_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/HG02026_HPR_Rep2_Cap04_70Me_30M_target.interval_list --INPUT R3j_REMOVE_DUPLICATES/HG02026_HPR_Rep2_Cap04_70Me_30M.bam --OUTPUT GATK4/GATK4-removeDupTrue.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --REFERENCE_SEQUENCE /data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa -VALIDATION_STRINGENCY LENIENT --COVERAGE_CAP 100000

    0
    Comment actions Permalink
  • Avatar
    JonR

    For PCT_TARGET_BASES_50X

    version 4.0.00.00

    0.724252

    gatk4-4.1.7.0-0

    0.068928

     

    This is with picard remove duplicates set to true.  Doesn't seem to have an issue with false.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi JonR, MarkDuplicates tags the duplicates and while using CollectHsMetrics the duplicates are ignored. So you should be getting the same coverage while using GATK 4.1.8.0 with remove-duplicates true and false. Please test this out and confirm it is true. 

    Regarding GATK 4.0.0.0-0, that is a very old version of GATK4 and we do not recommend that you use it for results or for comparison, since it may not be flagging variants correctly. 

    Let us know if you get the same coverage with remove duplicates true and false for the same version of GATK. When testing with 50X, it would be reasonable to have a low coverage.

    0
    Comment actions Permalink
  • Avatar
    JonR

    I just updated GATK-4 to the newest version 4.1.8 and ran it.   These are the results got with CollectHsMetrics.

    Keep in mind bam files were still created with version 4.0.0.0-0.   I can redo the entire pipeline with newest gatk if you want.

     

    #GATK-4 FALSE

    • FOLD_80_BASE_PENALTY  4.981967 
    • PCT_TARGET_BASES_20X    0.611954

     

    #GATK-4 TRUE

    • FOLD_80_BASE_PENALTY    0.85497
    • PCT_TARGET_BASES_20X     0.407856

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Could you send me the complete commands you are using for MarkDuplicates and CollectHsMetrics?

    0
    Comment actions Permalink
  • Avatar
    JonR

    CollectHsMetrics command FALSE

     

    # gatk CollectHsMetrics --TMP_DIR TEMP --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list --INPUT R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bam --OUTPUT GATK4/XXXXHyper_1_C_50.7X_hs_metricsFalse.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --REFERENCE_SEQUENCE /data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa -VALIDATION_STRINGENCY LENIENT --COVERAGE_CAP 100000
    # =================================================================
    16:08:12.575 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data1/BIOINFORMATICS/SOFTWARE/ANACONDA_JN/MINI-CONDA/envs/gatk-newest/share/gatk4-4.1.8.0-0/gatk-package-4.1.8.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    [Mon Jul 20 16:08:12 EDT 2020] CollectHsMetrics --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list --INPUT R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bam --OUTPUT GATK4/XXXXHyper_1_C_50.7X_hs_metricsFalse.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --COVERAGE_CAP 100000 --TMP_DIR TEMP --VALIDATION_STRINGENCY LENIENT --REFERENCE_SEQUENCE /data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa --NEAR_DISTANCE 250 --MINIMUM_MAPPING_QUALITY 20 --MINIMUM_BASE_QUALITY 20 --CLIP_OVERLAPPING_READS true --INCLUDE_INDELS false --SAMPLE_SIZE 10000 --ALLELE_FRACTION 0.001 --ALLELE_FRACTION 0.005 --ALLELE_FRACTION 0.01 --ALLELE_FRACTION 0.02 --ALLELE_FRACTION 0.05 --ALLELE_FRACTION 0.1 --ALLELE_FRACTION 0.2 --ALLELE_FRACTION 0.3 --ALLELE_FRACTION 0.5 --VERBOSITY INFO --QUIET false --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
    Jul 20, 2020 4:08:12 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    [Mon Jul 20 16:08:12 EDT 2020] Executing as XXXX.com on Linux 3.10.0-1062.1.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_152-release-1056-b12; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.8.0
    WARNING: BAM index file /data/BIOINFORMATICS/PROJECT_PROD_JN/1910_NGHC-Benchmark_NGHC1_v3/SEQUENOMICS/seqcap19-TWIST/R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bai is older than BAM /data/BIOINFORMATICS/PROJECT_PROD_JN/1910_NGHC-Benchmark_NGHC1_v3/SEQUENOMICS/seqcap19-TWIST/R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bam
    INFO 2020-07-20 16:08:36 CollectHsMetrics Processed 1,000,000 records. Elapsed time: 00:00:12s. Time for last 1,000,000: 12s. Last read position: chr1:109,250,541
    INFO 2020-07-20 16:08:46 CollectHsMetrics Processed 2,000,000 records. Elapsed time: 00:00:21s. Time for last 1,000,000: 9s. Last read position: chr1:227,845,674

     

    CollectHsMetrics command True

    gatk CollectHsMetrics --TMP_DIR TEMP --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list --INPUT R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X.bam --OUTPUT GATK4/XXXXHyper_1_C_50.7X_hs_metrics.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --REFERENCE_SEQUENCE /data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa -VALIDATION_STRINGENCY LENIENT --COVERAGE_CAP 100000 --PER_BASE_COVERAGE GATK4/XXXXHyper_1_C_50.7X_per_base_coverage.txt
    # =================================================================
    16:08:49.344 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data1/BIOINFORMATICS/SOFTWARE/ANACONDA_JN/MINI-CONDA/envs/gatk-newest/share/gatk4-4.1.8.0-0/gatk-package-4.1.8.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    [Mon Jul 20 16:08:49 EDT 2020] CollectHsMetrics --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list --INPUT R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X.bam --OUTPUT GATK4/XXXXHyper_1_C_50.7X_hs_metrics.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --PER_BASE_COVERAGE GATK4/XXXXHyper_1_C_50.7X_per_base_coverage.txt --COVERAGE_CAP 100000 --TMP_DIR TEMP --VALIDATION_STRINGENCY LENIENT --REFERENCE_SEQUENCE /data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa --NEAR_DISTANCE 250 --MINIMUM_MAPPING_QUALITY 20 --MINIMUM_BASE_QUALITY 20 --CLIP_OVERLAPPING_READS true --INCLUDE_INDELS false --SAMPLE_SIZE 10000 --ALLELE_FRACTION 0.001 --ALLELE_FRACTION 0.005 --ALLELE_FRACTION 0.01 --ALLELE_FRACTION 0.02 --ALLELE_FRACTION 0.05 --ALLELE_FRACTION 0.1 --ALLELE_FRACTION 0.2 --ALLELE_FRACTION 0.3 --ALLELE_FRACTION 0.5 --VERBOSITY INFO --QUIET false --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
    Jul 20, 2020 4:08:49 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    [Mon Jul 20 16:08:49 EDT 2020] Executing as XXXX.com on Linux 3.10.0-1062.1.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_152-release-1056-b12; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.8.0
    WARNING: BAM index file /data/BIOINFORMATICS/PROJECT_PROD_JN/1910_NGHC-Benchmark_NGHC1_v3/SEQUENOMICS/seqcap19-TWIST/R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X.bai is older than BAM /data/BIOINFORMATICS/PROJECT_PROD_JN/1910_NGHC-Benchmark_NGHC1_v3/SEQUENOMICS/seqcap19-TWIST/R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X.bam
    INFO 2020-07-20 16:09:06 CollectHsMetrics Processed 1,000,000 records. Elapsed time: 00:00:12s. Time for last 1,000,000: 12s. Last read position: chr1:109,491,321
    INFO 2020-07-20 16:09:17 CollectHsMetrics Processed 2,000,000 records. Elapsed time: 00:00:22s. Time for last 1,000,000: 10s. Last read position: chr1:228,372,652

    Marking Duplicates with 4.0.00.00 FALSE

    gatk-launch MarkDuplicates --TMP_DIR TEMP --VALIDATION_STRINGENCY LENIENT -I R3h_SORT/XXXXHyper_1_C_50.7X.bam -O R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bam --METRICS_FILE R3_STATS/XXXXHyper_1_C_50.7X_markDuplicatesFalse.txt --REMOVE_DUPLICATES false --ASSUME_SORTED true --CREATE_INDEX true > LOGS/R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X__markDuplicatesFalse.log 2>&1
    valdationAgainstSogloffRUn-1C.log

    Marking Duplicates with 4.0.00.00 TRUE

    gatk-launch MarkDuplicates --TMP_DIR TEMP --VALIDATION_STRINGENCY LENIENT -I R3h_SORT/XXXXHyper_1_C_50.7X.bam -O R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X.bam --METRICS_FILE R3_STATS/XXXXHyper_1_C_50.7X_markDuplicatesTrue.txt --REMOVE_DUPLICATES true --ASSUME_SORTED true --CREATE_INDEX true > LOGS/R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X__markDuplicatesTrue.log 2>&1

     

     

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    JonR are you running MarkDuplicates with 4.0.00.00? Please run it with the newest version of GATK. If you can run the whole pipeline with the newest version of GATK, that would be the most helpful.

    0
    Comment actions Permalink
  • Avatar
    JonR

    Every part of this pipeline was done with the newest version of GATk4.  I installed it today.  

    I started from a sam file and every operation was ran today

     

    #GATK-4 FALSE

    • FOLD_80_BASE_PENALTY 4.981967
    • PCT_TARGET_BASES_20X 0.611954


    #GATK-4 TRUE

    • FOLD_80_BASE_PENALTY 0.85497
    • PCT_TARGET_BASES_20X 0.407856

     

    0
    Comment actions Permalink
  • Avatar
    JonR

    My commands

    • mkdir -p R3e_MAP_READS
    • mkdir -p R3f_SAM_TO_BAM
    • mkdir -p R3g_FIXMATE
    • mkdir -p R3n_BED_TO_INTERVAL_LIST
    • mkdir -p R3h_SORT
    • mkdir -p R3q_REMOVE_DUPLICATES_FALSE
    • mkdir -p R3j_REMOVE_DUPLICATES
    • mkdir -p R3_STATS
    • gatk BedToIntervalList --TMP_DIR TEMP --INPUT BED/TEST.bed --SEQUENCE_DICTIONARY $DICT --OUTPUT R3n_BED_TO_INTERVAL_LIST/$SAMPLE\_target.interval_list
    • gatk BedToIntervalList --TMP_DIR TEMP --INPUT BED/TEST.bed --SEQUENCE_DICTIONARY $DICT --OUTPUT R3n_BED_TO_INTERVAL_LIST/$SAMPLE\_bait.interval_list
    • gatk SamFormatConverter -I R3e_MAP_READS/$SAMPLE\.sam -O R3f_SAM_TO_BAM/$SAMPLE\.bam
    • rm R3e_MAP_READS/$SAMPLE\.sam
    • gatk FixMateInformation --TMP_DIR TEMP -I R3f_SAM_TO_BAM/$SAMPLE\.bam -O R3g_FIXMATE/$SAMPLE\.bam
    • gatk SortSam --TMP_DIR TEMP -I R3g_FIXMATE/$SAMPLE\.bam -O R3h_SORT/$SAMPLE\.bam -SO coordinate
    • gatk MarkDuplicates --TMP_DIR TEMP --VALIDATION_STRINGENCY LENIENT -I R3h_SORT/$SAMPLE\.bam -O R3q_REMOVE_DUPLICATES_FALSE/$SAMPLE\.bam --METRICS_FILE R3_STATS/$SAMPLE\_markDuplicatesFalse.txt --REMOVE_DUPLICATES false --ASSUME_SORTED true --CREATE_INDEX true
    • gatk MarkDuplicates --TMP_DIR TEMP --VALIDATION_STRINGENCY LENIENT -I R3h_SORT/$SAMPLE\.bam -O R3j_REMOVE_DUPLICATES/$SAMPLE\.bam --METRICS_FILE R3_STATS/$SAMPLE\_markDuplicatesTrue.txt --REMOVE_DUPLICATES true --ASSUME_SORTED true --CREATE_INDEX tru
    • gatk CollectHsMetrics --TMP_DIR TEMP --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/$SAMPLE\_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/$SAMPLE\_target.interval_list --INPUT R3q_REMOVE_DUPLICATES_FALSE/$SAMPLE\.bam --OUTPUT GATK4/$SAMPLE\_hs_metricsFalse.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --REFERENCE_SEQUENCE $REF -VALIDATION_STRINGENCY LENIENT --COVERAGE_CAP 100000
    • gatk CollectHsMetrics --TMP_DIR TEMP --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/$SAMPLE\_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/$SAMPLE\_target.interval_list --INPUT R3j_REMOVE_DUPLICATES/$SAMPLE\.bam --OUTPUT GATK4/$SAMPLE\_hs_metrics.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --REFERENCE_SEQUENCE $REF -VALIDATION_STRINGENCY LENIENT --COVERAGE_CAP 100000 --PER_BASE_COVERAGE GATK4/$SAMPLE\_per_base_coverage.txt

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi JonR, thanks for the clarification. Would you be able to follow these instructions and upload the files you are using so we can test it on our end?

    0
    Comment actions Permalink
  • Avatar
    JonR

    Ok file uploaded.  

     

    GATKBUG-Brandt-JonR.tar.gz

     

    Check out the readme contained within.  Simply activate the GATK conda environment, run the script and you should have all the files created from the original bam.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thank you for uploading those files, JonR, we will look into it and get back to you.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi JonR, thank you for your patience as we look into this. I spoke with a Picard developer, and found that I previously misspoke. The FOLD_80_BASE_PENALTY and PCT_TARGET_BASES_20X are not ignoring duplicates, the coverage that ignores duplicates has "UNIQUE" or "UQ" in the name. So, there is the explanation for why you are getting different coverage values with --remove-duplicates true and false. 

    Would you also like us to look into these methods changing in the newer GATK versions? If so, please upload all the input files for MarkDuplicates and CollectHsMetrics, so that we do not have to run any of the other steps. Also please include the sample coverage output file for all of these tests so that I can see the final output on your end.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk