Big variance of PCT_BASES_20X between software versions
I'm seeing a big variance of PCT_TARGET_BASES_20X between software versions and bed files
Same bed files and interval files. Just change the software versions and mark duplicates True & False
PCT_TARGET_BASES_20X
GATK4 True 0.407856
GATK3 True 0.638736
GATK4 False 0.611954
GATK3 False 0.638736
What is the preferred metric? Is it GATK4 with duplicates removed = true? That is the outlier.
Can you please provide
a) GATK version used
picard-1.134
gatk4-4.1.7
b) Exact GATK commands used
For marking duplicates
gatk-launch MarkDuplicates --TMP_DIR TEMP --VALIDATION_STRINGENCY LENIENT -I R3h_SORT/XXXXHyper_1_C_50.7X.bam -O R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bam --METRICS_FILE R3_STATS/XXXXHyper_1_C_50.7X_markDuplicatesFalse.txt --REMOVE_DUPLICATES false --ASSUME_SORTED true --CREATE_INDEX true
gatk-launch CollectHsMetrics --TMP_DIR TEMP --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list --INPUT R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bam --OUTPUT R3_STATS/XXXXHyper_1_C_50.7X_hs_metricsFalse.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --REFERENCE_SEQUENCE /mnt/hg38.fa -VALIDATION_STRINGENCY LENIENT --COVERAGE_CAP 100000 --PER_BASE_COVERAGE R3_STATS/XXXXHyper_1_C_50.7X_per_base_coverageFalse.txt
For calling Metrics
#gatk CollectHsMetrics --TMP_DIR TEMP --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list --INPUT R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bam --OUTPUT GATK4/GATK4-removeDupFalse.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --REFERENCE_SEQUENCE /data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa -VALIDATION_STRINGENCY LENIENT --COVERAGE_CAP 100000
#gatk CollectHsMetrics --TMP_DIR TEMP --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list --INPUT R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X.bam --OUTPUT GATK4/GATK4-removeDupTrue.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --REFERENCE_SEQUENCE /data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa -VALIDATION_STRINGENCY LENIENT --COVERAGE_CAP 100000
(base) [nowackj1@ridus004 seqcap19-TWIST]$ grep -P "^#" gatkPost.log | grep Hs
#gatk CollectHsMetrics --TMP_DIR TEMP --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list --INPUT R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bam --OUTPUT GATK4/GATK4-removeDupFalse.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --REFERENCE_SEQUENCE /data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa -VALIDATION_STRINGENCY LENIENT --COVERAGE_CAP 100000
#java -Xmx20g -jar picard.jar CalculateHsMetrics VALIDATION_STRINGENCY=LENIENT LEVEL=ALL_READS TMP_DIR=/data1/BIOINFORMATICS/TEMP/ INPUT=R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bam OUTPUT=GATK4/GATK3-removeDupFalse.txt REFERENCE_SEQUENCE=/data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa BAIT_INTERVALS=R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list BAIT_SET_NAME=capture TARGET_INTERVALS=R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list PER_TARGET_COVERAGE=GATK4/GATK3-removeDupFalse.cvg.txt
#gatk CollectHsMetrics --TMP_DIR TEMP --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list --INPUT R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X.bam --OUTPUT GATK4/GATK4-removeDupTrue.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --REFERENCE_SEQUENCE /data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa -VALIDATION_STRINGENCY LENIENT --COVERAGE_CAP 100000
#java -Xmx20g -jar picard.jar CalculateHsMetrics VALIDATION_STRINGENCY=LENIENT LEVEL=ALL_READS TMP_DIR=/data1/BIOINFORMATICS/TEMP/ INPUT=R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X.bam OUTPUT=GATK4/GATK3-removeDupTrue.txt REFERENCE_SEQUENCE=/data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa BAIT_INTERVALS=R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list BAIT_SET_NAME=capture TARGET_INTERVALS=R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list PER_TARGET_COVERAGE=GATK4/GATK3-removeDupTrue.cvg.txt
c) The entire error log if applicable.
#gatk CollectHsMetrics --TMP_DIR TEMP --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list --INPUT R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bam --OUTPUT GATK4/GATK4-removeDupFalse.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --REFERENCE_SEQUENCE /data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa -VALIDATION_STRINGENCY LENIENT --COVERAGE_CAP 100000
18:44:23.344 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data1/BIOINFORMATICS/SOFTWARE/ANACONDA_JN/MINI-CONDA/envs/gatk-newest/share/gatk4-4.1.7.0-0/gatk-package-4.1.7.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Mon Jun 29 18:44:23 EDT 2020] CollectHsMetrics --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list --INPUT R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bam --OUTPUT GATK4/GATK4-removeDupFalse.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --COVERAGE_CAP 100000 --TMP_DIR TEMP --VALIDATION_STRINGENCY LENIENT --REFERENCE_SEQUENCE /data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa --NEAR_DISTANCE 250 --MINIMUM_MAPPING_QUALITY 20 --MINIMUM_BASE_QUALITY 20 --CLIP_OVERLAPPING_READS true --INCLUDE_INDELS false --SAMPLE_SIZE 10000 --ALLELE_FRACTION 0.001 --ALLELE_FRACTION 0.005 --ALLELE_FRACTION 0.01 --ALLELE_FRACTION 0.02 --ALLELE_FRACTION 0.05 --ALLELE_FRACTION 0.1 --ALLELE_FRACTION 0.2 --ALLELE_FRACTION 0.3 --ALLELE_FRACTION 0.5 --VERBOSITY INFO --QUIET false --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
Jun 29, 2020 6:44:23 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
[Mon Jun 29 18:44:23 EDT 2020] Executing as server-name on Linux 3.10.0-1062.1.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_152-release-1056-b12; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.7.0
WARNING: BAM index file /data/BIOINFORMATICS/PROJECT_PROD_JN/1910_NGHC-Benchmark_NGHC1_v3/SEQUENOMICS/seqcap19-XXXX/R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bai is older than BAM /data/BIOINFORMATICS/PROJECT_PROD_JN/1910_NGHC-Benchmark_NGHC1_v3/SEQUENOMICS/seqcap19-XXXX/R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bam
INFO 2020-06-29 18:44:38 CollectHsMetrics Processed 1,000,000 records. Elapsed time: 00:00:10s. Time for last 1,000,000: 10s. Last read position: chr1:109,250,541
INFO 2020-06-29 18:44:48 CollectHsMetrics Processed 2,000,000 records. Elapsed time: 00:00:20s. Time for last 1,000,000: 9s. Last read position: chr1:227,845,674
INFO 2020-06-29 18:45:00 CollectHsMetrics Processed 3,000,000 records. Elapsed time: 00:00:32s. Time for last 1,000,000: 12s. Last read position: chr2:189,006,940
INFO 2020-06-29 18:45:15 CollectHsMetrics Processed 4,000,000 records. Elapsed time: 00:00:47s. Time for last 1,000,000: 14s. Last read position: chr3:105,681,279
INFO 2020-06-29 18:45:26 CollectHsMetrics Processed 5,000,000 records. Elapsed time: 00:00:58s. Time for last 1,000,000: 11s. Last read position: chr4:147,822,923
INFO 2020-06-29 18:45:39 CollectHsMetrics Processed 6,000,000 records. Elapsed time: 00:01:11s. Time for last 1,000,000: 12s. Last read position: chr6:24,667,266
INFO 2020-06-29 18:45:51 CollectHsMetrics Processed 7,000,000 records. Elapsed time: 00:01:23s. Time for last 1,000,000: 11s. Last read position: chr7:1,448,633
INFO 2020-06-29 18:46:00 CollectHsMetrics Processed 8,000,000 records. Elapsed time: 00:01:32s. Time for last 1,000,000: 9s. Last read position: chr8:7,741,143
INFO 2020-06-29 18:46:10 CollectHsMetrics Processed 9,000,000 records. Elapsed time: 00:01:42s. Time for last 1,000,000: 10s. Last read position: chr9:104,617,636
INFO 2020-06-29 18:46:20 CollectHsMetrics Processed 10,000,000 records. Elapsed time: 00:01:52s. Time for last 1,000,000: 9s. Last read position: chr10:102,147,187
INFO 2020-06-29 18:46:30 CollectHsMetrics Processed 11,000,000 records. Elapsed time: 00:02:02s. Time for last 1,000,000: 9s. Last read position: chr11:67,424,094
INFO 2020-06-29 18:46:39 CollectHsMetrics Processed 12,000,000 records. Elapsed time: 00:02:11s. Time for last 1,000,000: 9s. Last read position: chr12:57,198,559
INFO 2020-06-29 18:46:50 CollectHsMetrics Processed 13,000,000 records. Elapsed time: 00:02:21s. Time for last 1,000,000: 10s. Last read position: chr14:67,561,902
INFO 2020-06-29 18:46:59 CollectHsMetrics Processed 14,000,000 records. Elapsed time: 00:02:31s. Time for last 1,000,000: 9s. Last read position: chr15:89,155,418
INFO 2020-06-29 18:47:08 CollectHsMetrics Processed 15,000,000 records. Elapsed time: 00:02:40s. Time for last 1,000,000: 8s. Last read position: chr16:70,832,841
INFO 2020-06-29 18:47:17 CollectHsMetrics Processed 16,000,000 records. Elapsed time: 00:02:49s. Time for last 1,000,000: 9s. Last read position: chr17:42,698,878
INFO 2020-06-29 18:47:26 CollectHsMetrics Processed 17,000,000 records. Elapsed time: 00:02:58s. Time for last 1,000,000: 9s. Last read position: chr19:2,477,058
INFO 2020-06-29 18:47:36 CollectHsMetrics Processed 18,000,000 records. Elapsed time: 00:03:08s. Time for last 1,000,000: 9s. Last read position: chr19:48,195,814
INFO 2020-06-29 18:47:45 CollectHsMetrics Processed 19,000,000 records. Elapsed time: 00:03:17s. Time for last 1,000,000: 9s. Last read position: chr21:34,521,461
INFO 2020-06-29 18:47:54 CollectHsMetrics Processed 20,000,000 records. Elapsed time: 00:03:26s. Time for last 1,000,000: 9s. Last read position: chrX:71,663,261
INFO 2020-06-29 18:47:58 TheoreticalSensitivity Creating Roulette Wheel
INFO 2020-06-29 18:47:58 TheoreticalSensitivity Calculating quality sums from quality sampler
INFO 2020-06-29 18:47:58 TheoreticalSensitivity 0 sampling iterations completed
INFO 2020-06-29 18:48:03 TheoreticalSensitivity 1000 sampling iterations completed
INFO 2020-06-29 18:48:08 TheoreticalSensitivity 2000 sampling iterations completed
INFO 2020-06-29 18:48:13 TheoreticalSensitivity 3000 sampling iterations completed
INFO 2020-06-29 18:48:19 TheoreticalSensitivity 4000 sampling iterations completed
INFO 2020-06-29 18:48:24 TheoreticalSensitivity 5000 sampling iterations completed
INFO 2020-06-29 18:48:29 TheoreticalSensitivity 6000 sampling iterations completed
INFO 2020-06-29 18:48:34 TheoreticalSensitivity 7000 sampling iterations completed
INFO 2020-06-29 18:48:39 TheoreticalSensitivity 8000 sampling iterations completed
INFO 2020-06-29 18:48:45 TheoreticalSensitivity 9000 sampling iterations completed
INFO 2020-06-29 18:48:50 TheoreticalSensitivity Calculating theoretical het sensitivity
INFO 2020-06-29 18:48:52 TargetMetricsCollector Calculating GC metrics
[Mon Jun 29 18:48:52 EDT 2020] picard.analysis.directed.CollectHsMetrics done. Elapsed time: 4.49 minutes.
Runtime.totalMemory()=3510632448
Tool returned:
0
Using GATK jar /data1/BIOINFORMATICS/SOFTWARE/ANACONDA_JN/MINI-CONDA/envs/gatk-newest/share/gatk4-4.1.7.0-0/gatk-package-4.1.7.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /data1/BIOINFORMATICS/SOFTWARE/ANACONDA_JN/MINI-CONDA/envs/gatk-newest/share/gatk4-4.1.7.0-0/gatk-package-4.1.7.0-local.jar CollectHsMetrics --TMP_DIR TEMP --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list --INPUT R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bam --OUTPUT GATK4/GATK4-removeDupFalse.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --REFERENCE_SEQUENCE /data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa -VALIDATION_STRINGENCY LENIENT --COVERAGE_CAP 100000
#
#
#-------------------------------------------------------------------
#
#
#java -Xmx20g -jar picard.jar CalculateHsMetrics VALIDATION_STRINGENCY=LENIENT LEVEL=ALL_READS TMP_DIR=/data1/BIOINFORMATICS/TEMP/ INPUT=R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bam OUTPUT=GATK4/GATK3-removeDupFalse.txt REFERENCE_SEQUENCE=/data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa BAIT_INTERVALS=R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list BAIT_SET_NAME=capture TARGET_INTERVALS=R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list PER_TARGET_COVERAGE=GATK4/GATK3-removeDupFalse.cvg.txt
[Mon Jun 29 18:48:53 EDT 2020] picard.analysis.directed.CalculateHsMetrics BAIT_INTERVALS=[R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list] BAIT_SET_NAME=capture TARGET_INTERVALS=[R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list] INPUT=R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bam OUTPUT=GATK4/GATK3-removeDupFalse.txt METRIC_ACCUMULATION_LEVEL=[ALL_READS] PER_TARGET_COVERAGE=GATK4/GATK3-removeDupFalse.cvg.txt TMP_DIR=[/data1/BIOINFORMATICS/TEMP] VALIDATION_STRINGENCY=LENIENT REFERENCE_SEQUENCE=/data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Mon Jun 29 18:48:53 EDT 2020] Executing as server-name on Linux 3.10.0-1062.1.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_152-release-1056-b12; Picard version: 1.134() JdkDeflater
WARNING: BAM index file /data/BIOINFORMATICS/PROJECT_PROD_JN/1910_NGHC-Benchmark_NGHC1_v3/SEQUENOMICS/seqcap19-XXXX/R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bai is older than BAM /data/BIOINFORMATICS/PROJECT_PROD_JN/1910_NGHC-Benchmark_NGHC1_v3/SEQUENOMICS/seqcap19-XXXX/R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bam
INFO 2020-06-29 18:49:00 CalculateHsMetrics Processed 1,000,000 records. Elapsed time: 00:00:03s. Time for last 1,000,000: 3s. Last read position: chr1:109,250,541
INFO 2020-06-29 18:49:03 CalculateHsMetrics Processed 2,000,000 records. Elapsed time: 00:00:06s. Time for last 1,000,000: 3s. Last read position: chr1:227,845,674
INFO 2020-06-29 18:49:06 CalculateHsMetrics Processed 3,000,000 records. Elapsed time: 00:00:08s. Time for last 1,000,000: 2s. Last read position: chr2:189,006,940
INFO 2020-06-29 18:49:09 CalculateHsMetrics Processed 4,000,000 records. Elapsed time: 00:00:11s. Time for last 1,000,000: 3s. Last read position: chr3:105,681,279
INFO 2020-06-29 18:49:12 CalculateHsMetrics Processed 5,000,000 records. Elapsed time: 00:00:15s. Time for last 1,000,000: 3s. Last read position: chr4:147,822,923
INFO 2020-06-29 18:49:14 CalculateHsMetrics Processed 6,000,000 records. Elapsed time: 00:00:17s. Time for last 1,000,000: 2s. Last read position: chr6:24,667,266
INFO 2020-06-29 18:49:18 CalculateHsMetrics Processed 7,000,000 records. Elapsed time: 00:00:20s. Time for last 1,000,000: 3s. Last read position: chr7:1,448,633
INFO 2020-06-29 18:49:20 CalculateHsMetrics Processed 8,000,000 records. Elapsed time: 00:00:23s. Time for last 1,000,000: 2s. Last read position: chr8:7,741,143
INFO 2020-06-29 18:49:22 CalculateHsMetrics Processed 9,000,000 records. Elapsed time: 00:00:25s. Time for last 1,000,000: 2s. Last read position: chr9:104,617,636
INFO 2020-06-29 18:49:25 CalculateHsMetrics Processed 10,000,000 records. Elapsed time: 00:00:28s. Time for last 1,000,000: 3s. Last read position: chr10:102,147,187
INFO 2020-06-29 18:49:29 CalculateHsMetrics Processed 11,000,000 records. Elapsed time: 00:00:31s. Time for last 1,000,000: 3s. Last read position: chr11:67,424,094
INFO 2020-06-29 18:49:32 CalculateHsMetrics Processed 12,000,000 records. Elapsed time: 00:00:34s. Time for last 1,000,000: 2s. Last read position: chr12:57,198,559
INFO 2020-06-29 18:49:34 CalculateHsMetrics Processed 13,000,000 records. Elapsed time: 00:00:37s. Time for last 1,000,000: 2s. Last read position: chr14:67,561,902
INFO 2020-06-29 18:49:36 CalculateHsMetrics Processed 14,000,000 records. Elapsed time: 00:00:39s. Time for last 1,000,000: 2s. Last read position: chr15:89,155,418
INFO 2020-06-29 18:49:38 CalculateHsMetrics Processed 15,000,000 records. Elapsed time: 00:00:41s. Time for last 1,000,000: 2s. Last read position: chr16:70,832,841
INFO 2020-06-29 18:49:41 CalculateHsMetrics Processed 16,000,000 records. Elapsed time: 00:00:44s. Time for last 1,000,000: 2s. Last read position: chr17:42,698,878
INFO 2020-06-29 18:49:44 CalculateHsMetrics Processed 17,000,000 records. Elapsed time: 00:00:47s. Time for last 1,000,000: 2s. Last read position: chr19:2,477,058
INFO 2020-06-29 18:49:46 CalculateHsMetrics Processed 18,000,000 records. Elapsed time: 00:00:49s. Time for last 1,000,000: 2s. Last read position: chr19:48,195,814
INFO 2020-06-29 18:49:49 CalculateHsMetrics Processed 19,000,000 records. Elapsed time: 00:00:51s. Time for last 1,000,000: 2s. Last read position: chr21:34,521,461
INFO 2020-06-29 18:49:51 CalculateHsMetrics Processed 20,000,000 records. Elapsed time: 00:00:54s. Time for last 1,000,000: 2s. Last read position: chrX:71,663,261
INFO 2020-06-29 18:49:52 TargetMetricsCollector Calculating GC metrics
[Mon Jun 29 18:49:53 EDT 2020] picard.analysis.directed.CalculateHsMetrics done. Elapsed time: 0.99 minutes.
Runtime.totalMemory()=8103395328
#
#
#-------------------------------------------------------------------
#
#
#gatk CollectHsMetrics --TMP_DIR TEMP --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list --INPUT R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X.bam --OUTPUT GATK4/GATK4-removeDupTrue.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --REFERENCE_SEQUENCE /data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa -VALIDATION_STRINGENCY LENIENT --COVERAGE_CAP 100000
18:49:56.671 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data1/BIOINFORMATICS/SOFTWARE/ANACONDA_JN/MINI-CONDA/envs/gatk-newest/share/gatk4-4.1.7.0-0/gatk-package-4.1.7.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Mon Jun 29 18:49:56 EDT 2020] CollectHsMetrics --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list --INPUT R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X.bam --OUTPUT GATK4/GATK4-removeDupTrue.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --COVERAGE_CAP 100000 --TMP_DIR TEMP --VALIDATION_STRINGENCY LENIENT --REFERENCE_SEQUENCE /data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa --NEAR_DISTANCE 250 --MINIMUM_MAPPING_QUALITY 20 --MINIMUM_BASE_QUALITY 20 --CLIP_OVERLAPPING_READS true --INCLUDE_INDELS false --SAMPLE_SIZE 10000 --ALLELE_FRACTION 0.001 --ALLELE_FRACTION 0.005 --ALLELE_FRACTION 0.01 --ALLELE_FRACTION 0.02 --ALLELE_FRACTION 0.05 --ALLELE_FRACTION 0.1 --ALLELE_FRACTION 0.2 --ALLELE_FRACTION 0.3 --ALLELE_FRACTION 0.5 --VERBOSITY INFO --QUIET false --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
Jun 29, 2020 6:49:56 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
[Mon Jun 29 18:49:56 EDT 2020] Executing as server-name on Linux 3.10.0-1062.1.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_152-release-1056-b12; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.7.0
WARNING: BAM index file /data/BIOINFORMATICS/PROJECT_PROD_JN/1910_NGHC-Benchmark_NGHC1_v3/SEQUENOMICS/seqcap19-XXXX/R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X.bai is older than BAM /data/BIOINFORMATICS/PROJECT_PROD_JN/1910_NGHC-Benchmark_NGHC1_v3/SEQUENOMICS/seqcap19-XXXX/R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X.bam
INFO 2020-06-29 18:50:13 CollectHsMetrics Processed 1,000,000 records. Elapsed time: 00:00:12s. Time for last 1,000,000: 12s. Last read position: chr1:109,491,321
INFO 2020-06-29 18:50:26 CollectHsMetrics Processed 2,000,000 records. Elapsed time: 00:00:25s. Time for last 1,000,000: 12s. Last read position: chr1:228,372,652
INFO 2020-06-29 18:50:40 CollectHsMetrics Processed 3,000,000 records. Elapsed time: 00:00:39s. Time for last 1,000,000: 14s. Last read position: chr2:200,636,966
INFO 2020-06-29 18:50:54 CollectHsMetrics Processed 4,000,000 records. Elapsed time: 00:00:53s. Time for last 1,000,000: 14s. Last read position: chr3:119,618,061
INFO 2020-06-29 18:51:09 CollectHsMetrics Processed 5,000,000 records. Elapsed time: 00:01:08s. Time for last 1,000,000: 14s. Last read position: chr4:175,640,181
INFO 2020-06-29 18:51:21 CollectHsMetrics Processed 6,000,000 records. Elapsed time: 00:01:20s. Time for last 1,000,000: 12s. Last read position: chr6:30,075,334
INFO 2020-06-29 18:51:33 CollectHsMetrics Processed 7,000,000 records. Elapsed time: 00:01:32s. Time for last 1,000,000: 11s. Last read position: chr7:10,974,255
INFO 2020-06-29 18:51:42 CollectHsMetrics Processed 8,000,000 records. Elapsed time: 00:01:41s. Time for last 1,000,000: 9s. Last read position: chr8:22,690,816
INFO 2020-06-29 18:51:53 CollectHsMetrics Processed 9,000,000 records. Elapsed time: 00:01:52s. Time for last 1,000,000: 10s. Last read position: chr9:121,141,739
INFO 2020-06-29 18:52:02 CollectHsMetrics Processed 10,000,000 records. Elapsed time: 00:02:01s. Time for last 1,000,000: 9s. Last read position: chr10:122,227,829
INFO 2020-06-29 18:52:12 CollectHsMetrics Processed 11,000,000 records. Elapsed time: 00:02:11s. Time for last 1,000,000: 10s. Last read position: chr11:76,661,433
INFO 2020-06-29 18:52:23 CollectHsMetrics Processed 12,000,000 records. Elapsed time: 00:02:21s. Time for last 1,000,000: 10s. Last read position: chr12:101,766,225
INFO 2020-06-29 18:52:32 CollectHsMetrics Processed 13,000,000 records. Elapsed time: 00:02:31s. Time for last 1,000,000: 9s. Last read position: chr14:91,289,246
INFO 2020-06-29 18:52:42 CollectHsMetrics Processed 14,000,000 records. Elapsed time: 00:02:41s. Time for last 1,000,000: 9s. Last read position: chr16:1,660,002
INFO 2020-06-29 18:52:51 CollectHsMetrics Processed 15,000,000 records. Elapsed time: 00:02:50s. Time for last 1,000,000: 9s. Last read position: chr16:89,553,778
INFO 2020-06-29 18:53:01 CollectHsMetrics Processed 16,000,000 records. Elapsed time: 00:02:59s. Time for last 1,000,000: 9s. Last read position: chr17:50,145,070
INFO 2020-06-29 18:53:10 CollectHsMetrics Processed 17,000,000 records. Elapsed time: 00:03:09s. Time for last 1,000,000: 9s. Last read position: chr19:8,490,634
INFO 2020-06-29 18:53:19 CollectHsMetrics Processed 18,000,000 records. Elapsed time: 00:03:18s. Time for last 1,000,000: 9s. Last read position: chr19:54,222,355
INFO 2020-06-29 18:53:29 CollectHsMetrics Processed 19,000,000 records. Elapsed time: 00:03:28s. Time for last 1,000,000: 9s. Last read position: chr22:20,142,823
INFO 2020-06-29 18:53:39 CollectHsMetrics Processed 20,000,000 records. Elapsed time: 00:03:38s. Time for last 1,000,000: 9s. Last read position: chrX:154,354,234
INFO 2020-06-29 18:53:40 TheoreticalSensitivity Creating Roulette Wheel
INFO 2020-06-29 18:53:40 TheoreticalSensitivity Calculating quality sums from quality sampler
INFO 2020-06-29 18:53:40 TheoreticalSensitivity 0 sampling iterations completed
INFO 2020-06-29 18:53:46 TheoreticalSensitivity 1000 sampling iterations completed
INFO 2020-06-29 18:53:51 TheoreticalSensitivity 2000 sampling iterations completed
INFO 2020-06-29 18:53:56 TheoreticalSensitivity 3000 sampling iterations completed
INFO 2020-06-29 18:54:01 TheoreticalSensitivity 4000 sampling iterations completed
INFO 2020-06-29 18:54:06 TheoreticalSensitivity 5000 sampling iterations completed
INFO 2020-06-29 18:54:11 TheoreticalSensitivity 6000 sampling iterations completed
INFO 2020-06-29 18:54:16 TheoreticalSensitivity 7000 sampling iterations completed
INFO 2020-06-29 18:54:21 TheoreticalSensitivity 8000 sampling iterations completed
INFO 2020-06-29 18:54:27 TheoreticalSensitivity 9000 sampling iterations completed
INFO 2020-06-29 18:54:32 TheoreticalSensitivity Calculating theoretical het sensitivity
INFO 2020-06-29 18:54:34 TargetMetricsCollector Calculating GC metrics
[Mon Jun 29 18:54:34 EDT 2020] picard.analysis.directed.CollectHsMetrics done. Elapsed time: 4.63 minutes.
Runtime.totalMemory()=3455057920
Tool returned:
0
Using GATK jar /data1/BIOINFORMATICS/SOFTWARE/ANACONDA_JN/MINI-CONDA/envs/gatk-newest/share/gatk4-4.1.7.0-0/gatk-package-4.1.7.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /data1/BIOINFORMATICS/SOFTWARE/ANACONDA_JN/MINI-CONDA/envs/gatk-newest/share/gatk4-4.1.7.0-0/gatk-package-4.1.7.0-local.jar CollectHsMetrics --TMP_DIR TEMP --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list --INPUT R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X.bam --OUTPUT GATK4/GATK4-removeDupTrue.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --REFERENCE_SEQUENCE /data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa -VALIDATION_STRINGENCY LENIENT --COVERAGE_CAP 100000
#
#
#-------------------------------------------------------------------
#
#
#java -Xmx20g -jar picard.jar CalculateHsMetrics VALIDATION_STRINGENCY=LENIENT LEVEL=ALL_READS TMP_DIR=/data1/BIOINFORMATICS/TEMP/ INPUT=R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X.bam OUTPUT=GATK4/GATK3-removeDupTrue.txt REFERENCE_SEQUENCE=/data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa BAIT_INTERVALS=R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list BAIT_SET_NAME=capture TARGET_INTERVALS=R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list PER_TARGET_COVERAGE=GATK4/GATK3-removeDupTrue.cvg.txt
[Mon Jun 29 18:54:35 EDT 2020] picard.analysis.directed.CalculateHsMetrics BAIT_INTERVALS=[R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list] BAIT_SET_NAME=capture TARGET_INTERVALS=[R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list] INPUT=R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X.bam OUTPUT=GATK4/GATK3-removeDupTrue.txt METRIC_ACCUMULATION_LEVEL=[ALL_READS] PER_TARGET_COVERAGE=GATK4/GATK3-removeDupTrue.cvg.txt TMP_DIR=[/data1/BIOINFORMATICS/TEMP] VALIDATION_STRINGENCY=LENIENT REFERENCE_SEQUENCE=/data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Mon Jun 29 18:54:35 EDT 2020] Executing as server-name on Linux 3.10.0-1062.1.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_152-release-1056-b12; Picard version: 1.134() JdkDeflater
WARNING: BAM index file /data/BIOINFORMATICS/PROJECT_PROD_JN/1910_NGHC-Benchmark_NGHC1_v3/SEQUENOMICS/seqcap19-XXXX/R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X.bai is older than BAM /data/BIOINFORMATICS/PROJECT_PROD_JN/1910_NGHC-Benchmark_NGHC1_v3/SEQUENOMICS/seqcap19-XXXX/R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X.bam
INFO 2020-06-29 18:54:41 CalculateHsMetrics Processed 1,000,000 records. Elapsed time: 00:00:03s. Time for last 1,000,000: 3s. Last read position: chr1:109,491,321
INFO 2020-06-29 18:54:45 CalculateHsMetrics Processed 2,000,000 records. Elapsed time: 00:00:06s. Time for last 1,000,000: 3s. Last read position: chr1:228,372,652
INFO 2020-06-29 18:54:47 CalculateHsMetrics Processed 3,000,000 records. Elapsed time: 00:00:08s. Time for last 1,000,000: 2s. Last read position: chr2:200,636,966
INFO 2020-06-29 18:54:50 CalculateHsMetrics Processed 4,000,000 records. Elapsed time: 00:00:11s. Time for last 1,000,000: 2s. Last read position: chr3:119,618,061
INFO 2020-06-29 18:54:53 CalculateHsMetrics Processed 5,000,000 records. Elapsed time: 00:00:14s. Time for last 1,000,000: 3s. Last read position: chr4:175,640,181
INFO 2020-06-29 18:54:56 CalculateHsMetrics Processed 6,000,000 records. Elapsed time: 00:00:17s. Time for last 1,000,000: 2s. Last read position: chr6:30,075,334
INFO 2020-06-29 18:54:59 CalculateHsMetrics Processed 7,000,000 records. Elapsed time: 00:00:20s. Time for last 1,000,000: 3s. Last read position: chr7:10,974,255
INFO 2020-06-29 18:55:01 CalculateHsMetrics Processed 8,000,000 records. Elapsed time: 00:00:22s. Time for last 1,000,000: 2s. Last read position: chr8:22,690,816
INFO 2020-06-29 18:55:03 CalculateHsMetrics Processed 9,000,000 records. Elapsed time: 00:00:25s. Time for last 1,000,000: 2s. Last read position: chr9:121,141,739
INFO 2020-06-29 18:55:06 CalculateHsMetrics Processed 10,000,000 records. Elapsed time: 00:00:28s. Time for last 1,000,000: 3s. Last read position: chr10:122,227,829
INFO 2020-06-29 18:55:10 CalculateHsMetrics Processed 11,000,000 records. Elapsed time: 00:00:31s. Time for last 1,000,000: 3s. Last read position: chr11:76,661,433
INFO 2020-06-29 18:55:12 CalculateHsMetrics Processed 12,000,000 records. Elapsed time: 00:00:33s. Time for last 1,000,000: 2s. Last read position: chr12:101,766,225
INFO 2020-06-29 18:55:15 CalculateHsMetrics Processed 13,000,000 records. Elapsed time: 00:00:36s. Time for last 1,000,000: 2s. Last read position: chr14:91,289,246
INFO 2020-06-29 18:55:17 CalculateHsMetrics Processed 14,000,000 records. Elapsed time: 00:00:38s. Time for last 1,000,000: 2s. Last read position: chr16:1,660,002
INFO 2020-06-29 18:55:19 CalculateHsMetrics Processed 15,000,000 records. Elapsed time: 00:00:41s. Time for last 1,000,000: 2s. Last read position: chr16:89,553,778
INFO 2020-06-29 18:55:22 CalculateHsMetrics Processed 16,000,000 records. Elapsed time: 00:00:43s. Time for last 1,000,000: 2s. Last read position: chr17:50,145,070
INFO 2020-06-29 18:55:25 CalculateHsMetrics Processed 17,000,000 records. Elapsed time: 00:00:46s. Time for last 1,000,000: 2s. Last read position: chr19:8,490,634
INFO 2020-06-29 18:55:27 CalculateHsMetrics Processed 18,000,000 records. Elapsed time: 00:00:48s. Time for last 1,000,000: 2s. Last read position: chr19:54,222,355
INFO 2020-06-29 18:55:29 CalculateHsMetrics Processed 19,000,000 records. Elapsed time: 00:00:51s. Time for last 1,000,000: 2s. Last read position: chr22:20,142,823
INFO 2020-06-29 18:55:32 CalculateHsMetrics Processed 20,000,000 records. Elapsed time: 00:00:53s. Time for last 1,000,000: 2s. Last read position: chrX:154,354,234
INFO 2020-06-29 18:55:33 TargetMetricsCollector Calculating GC metrics
[Mon Jun 29 18:55:33 EDT 2020] picard.analysis.directed.CalculateHsMetrics done. Elapsed time: 0.97 minutes.
Runtime.totalMemory()=8089763840
#
-
Hi JonR, thank you for the question. Here is the link to the documentation for MarkDuplicates.
The default option for --REMOVE_DUPLICATES is false, this is because GATK prioritizes keeping more information in case it is needed for your analysis. However, when you change the --REMOVE_DUPLICATES option to true, you are getting rid of duplicate reads, so it would make sense that it would change the PCT_TARGET_BASES_20X.
We cannot troubleshoot or test GATK3 because it is no longer supported, so we cannot provide a reason that the PCT_TARGET_BASES_20X did not change when you removed duplicates.
In terms of the preferred metric, it depends on your case. You can stick with the --REMOVE_DUPLICATES option as false, because it is the default. However, it depends on how you will use your data and if you will be primarily using GATK or using other platforms.
-
Hello Genenvieve,
Then compare gatk4-4.0.0.0-0 vs the newest version. At 50x I'm getting 72% vs 6% coverage on one of my datasets. That's a 66% difference.
-
Hi JonR, could you please give me the specific commands you used to get the 72% and 6% coverage?
-
version 4.0.00.00
gatk-launch CollectHsMetrics --TMP_DIR TEMP --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/HG02026_HPR_Rep2_Cap04_70Me_30M_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/HG02026_HPR_Rep2_Cap04_70Me_30M_target.interval_list --INPUT R3j_REMOVE_DUPLICATES/HG02026_HPR_Rep2_Cap04_70Me_30M.bam --OUTPUT GATK4/GATK400-removeDupTrue.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --REFERENCE_SEQUENCE /data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa -VALIDATION_STRINGENCY LENIENT --COVERAGE_CAP 100000
gatk4-4.1.7.0-0
gatk CollectHsMetrics --TMP_DIR TEMP --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/HG02026_HPR_Rep2_Cap04_70Me_30M_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/HG02026_HPR_Rep2_Cap04_70Me_30M_target.interval_list --INPUT R3j_REMOVE_DUPLICATES/HG02026_HPR_Rep2_Cap04_70Me_30M.bam --OUTPUT GATK4/GATK4-removeDupTrue.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --REFERENCE_SEQUENCE /data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa -VALIDATION_STRINGENCY LENIENT --COVERAGE_CAP 100000
-
For PCT_TARGET_BASES_50X
version 4.0.00.00
0.724252
gatk4-4.1.7.0-0
0.068928
This is with picard remove duplicates set to true. Doesn't seem to have an issue with false.
-
Hi JonR, MarkDuplicates tags the duplicates and while using CollectHsMetrics the duplicates are ignored. So you should be getting the same coverage while using GATK 4.1.8.0 with remove-duplicates true and false. Please test this out and confirm it is true.
Regarding GATK 4.0.0.0-0, that is a very old version of GATK4 and we do not recommend that you use it for results or for comparison, since it may not be flagging variants correctly.
Let us know if you get the same coverage with remove duplicates true and false for the same version of GATK. When testing with 50X, it would be reasonable to have a low coverage.
-
I just updated GATK-4 to the newest version 4.1.8 and ran it. These are the results got with CollectHsMetrics.
Keep in mind bam files were still created with version 4.0.0.0-0. I can redo the entire pipeline with newest gatk if you want.
#GATK-4 FALSE
- FOLD_80_BASE_PENALTY 4.981967
- PCT_TARGET_BASES_20X 0.611954
#GATK-4 TRUE
- FOLD_80_BASE_PENALTY 0.85497
- PCT_TARGET_BASES_20X 0.407856
-
Could you send me the complete commands you are using for MarkDuplicates and CollectHsMetrics?
-
CollectHsMetrics command FALSE
# gatk CollectHsMetrics --TMP_DIR TEMP --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list --INPUT R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bam --OUTPUT GATK4/XXXXHyper_1_C_50.7X_hs_metricsFalse.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --REFERENCE_SEQUENCE /data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa -VALIDATION_STRINGENCY LENIENT --COVERAGE_CAP 100000
# =================================================================
16:08:12.575 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data1/BIOINFORMATICS/SOFTWARE/ANACONDA_JN/MINI-CONDA/envs/gatk-newest/share/gatk4-4.1.8.0-0/gatk-package-4.1.8.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Mon Jul 20 16:08:12 EDT 2020] CollectHsMetrics --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list --INPUT R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bam --OUTPUT GATK4/XXXXHyper_1_C_50.7X_hs_metricsFalse.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --COVERAGE_CAP 100000 --TMP_DIR TEMP --VALIDATION_STRINGENCY LENIENT --REFERENCE_SEQUENCE /data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa --NEAR_DISTANCE 250 --MINIMUM_MAPPING_QUALITY 20 --MINIMUM_BASE_QUALITY 20 --CLIP_OVERLAPPING_READS true --INCLUDE_INDELS false --SAMPLE_SIZE 10000 --ALLELE_FRACTION 0.001 --ALLELE_FRACTION 0.005 --ALLELE_FRACTION 0.01 --ALLELE_FRACTION 0.02 --ALLELE_FRACTION 0.05 --ALLELE_FRACTION 0.1 --ALLELE_FRACTION 0.2 --ALLELE_FRACTION 0.3 --ALLELE_FRACTION 0.5 --VERBOSITY INFO --QUIET false --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
Jul 20, 2020 4:08:12 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
[Mon Jul 20 16:08:12 EDT 2020] Executing as XXXX.com on Linux 3.10.0-1062.1.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_152-release-1056-b12; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.8.0
WARNING: BAM index file /data/BIOINFORMATICS/PROJECT_PROD_JN/1910_NGHC-Benchmark_NGHC1_v3/SEQUENOMICS/seqcap19-TWIST/R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bai is older than BAM /data/BIOINFORMATICS/PROJECT_PROD_JN/1910_NGHC-Benchmark_NGHC1_v3/SEQUENOMICS/seqcap19-TWIST/R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bam
INFO 2020-07-20 16:08:36 CollectHsMetrics Processed 1,000,000 records. Elapsed time: 00:00:12s. Time for last 1,000,000: 12s. Last read position: chr1:109,250,541
INFO 2020-07-20 16:08:46 CollectHsMetrics Processed 2,000,000 records. Elapsed time: 00:00:21s. Time for last 1,000,000: 9s. Last read position: chr1:227,845,674CollectHsMetrics command True
gatk CollectHsMetrics --TMP_DIR TEMP --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list --INPUT R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X.bam --OUTPUT GATK4/XXXXHyper_1_C_50.7X_hs_metrics.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --REFERENCE_SEQUENCE /data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa -VALIDATION_STRINGENCY LENIENT --COVERAGE_CAP 100000 --PER_BASE_COVERAGE GATK4/XXXXHyper_1_C_50.7X_per_base_coverage.txt
# =================================================================
16:08:49.344 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data1/BIOINFORMATICS/SOFTWARE/ANACONDA_JN/MINI-CONDA/envs/gatk-newest/share/gatk4-4.1.8.0-0/gatk-package-4.1.8.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Mon Jul 20 16:08:49 EDT 2020] CollectHsMetrics --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/XXXXHyper_1_C_50.7X_target.interval_list --INPUT R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X.bam --OUTPUT GATK4/XXXXHyper_1_C_50.7X_hs_metrics.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --PER_BASE_COVERAGE GATK4/XXXXHyper_1_C_50.7X_per_base_coverage.txt --COVERAGE_CAP 100000 --TMP_DIR TEMP --VALIDATION_STRINGENCY LENIENT --REFERENCE_SEQUENCE /data/BIOINFORMATICS/REFERENCES/HG38_VALIDATION/hg38.fa --NEAR_DISTANCE 250 --MINIMUM_MAPPING_QUALITY 20 --MINIMUM_BASE_QUALITY 20 --CLIP_OVERLAPPING_READS true --INCLUDE_INDELS false --SAMPLE_SIZE 10000 --ALLELE_FRACTION 0.001 --ALLELE_FRACTION 0.005 --ALLELE_FRACTION 0.01 --ALLELE_FRACTION 0.02 --ALLELE_FRACTION 0.05 --ALLELE_FRACTION 0.1 --ALLELE_FRACTION 0.2 --ALLELE_FRACTION 0.3 --ALLELE_FRACTION 0.5 --VERBOSITY INFO --QUIET false --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
Jul 20, 2020 4:08:49 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
[Mon Jul 20 16:08:49 EDT 2020] Executing as XXXX.com on Linux 3.10.0-1062.1.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_152-release-1056-b12; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.8.0
WARNING: BAM index file /data/BIOINFORMATICS/PROJECT_PROD_JN/1910_NGHC-Benchmark_NGHC1_v3/SEQUENOMICS/seqcap19-TWIST/R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X.bai is older than BAM /data/BIOINFORMATICS/PROJECT_PROD_JN/1910_NGHC-Benchmark_NGHC1_v3/SEQUENOMICS/seqcap19-TWIST/R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X.bam
INFO 2020-07-20 16:09:06 CollectHsMetrics Processed 1,000,000 records. Elapsed time: 00:00:12s. Time for last 1,000,000: 12s. Last read position: chr1:109,491,321
INFO 2020-07-20 16:09:17 CollectHsMetrics Processed 2,000,000 records. Elapsed time: 00:00:22s. Time for last 1,000,000: 10s. Last read position: chr1:228,372,652Marking Duplicates with 4.0.00.00 FALSE
gatk-launch MarkDuplicates --TMP_DIR TEMP --VALIDATION_STRINGENCY LENIENT -I R3h_SORT/XXXXHyper_1_C_50.7X.bam -O R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X.bam --METRICS_FILE R3_STATS/XXXXHyper_1_C_50.7X_markDuplicatesFalse.txt --REMOVE_DUPLICATES false --ASSUME_SORTED true --CREATE_INDEX true > LOGS/R3q_REMOVE_DUPLICATES_FALSE/XXXXHyper_1_C_50.7X__markDuplicatesFalse.log 2>&1
valdationAgainstSogloffRUn-1C.logMarking Duplicates with 4.0.00.00 TRUE
gatk-launch MarkDuplicates --TMP_DIR TEMP --VALIDATION_STRINGENCY LENIENT -I R3h_SORT/XXXXHyper_1_C_50.7X.bam -O R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X.bam --METRICS_FILE R3_STATS/XXXXHyper_1_C_50.7X_markDuplicatesTrue.txt --REMOVE_DUPLICATES true --ASSUME_SORTED true --CREATE_INDEX true > LOGS/R3j_REMOVE_DUPLICATES/XXXXHyper_1_C_50.7X__markDuplicatesTrue.log 2>&1
-
JonR are you running MarkDuplicates with 4.0.00.00? Please run it with the newest version of GATK. If you can run the whole pipeline with the newest version of GATK, that would be the most helpful.
-
Every part of this pipeline was done with the newest version of GATk4. I installed it today.
I started from a sam file and every operation was ran today
#GATK-4 FALSE
- FOLD_80_BASE_PENALTY 4.981967
- PCT_TARGET_BASES_20X 0.611954
#GATK-4 TRUE- FOLD_80_BASE_PENALTY 0.85497
- PCT_TARGET_BASES_20X 0.407856
-
My commands
- mkdir -p R3e_MAP_READS
- mkdir -p R3f_SAM_TO_BAM
- mkdir -p R3g_FIXMATE
- mkdir -p R3n_BED_TO_INTERVAL_LIST
- mkdir -p R3h_SORT
- mkdir -p R3q_REMOVE_DUPLICATES_FALSE
- mkdir -p R3j_REMOVE_DUPLICATES
- mkdir -p R3_STATS
- gatk BedToIntervalList --TMP_DIR TEMP --INPUT BED/TEST.bed --SEQUENCE_DICTIONARY $DICT --OUTPUT R3n_BED_TO_INTERVAL_LIST/$SAMPLE\_target.interval_list
- gatk BedToIntervalList --TMP_DIR TEMP --INPUT BED/TEST.bed --SEQUENCE_DICTIONARY $DICT --OUTPUT R3n_BED_TO_INTERVAL_LIST/$SAMPLE\_bait.interval_list
- gatk SamFormatConverter -I R3e_MAP_READS/$SAMPLE\.sam -O R3f_SAM_TO_BAM/$SAMPLE\.bam
- rm R3e_MAP_READS/$SAMPLE\.sam
- gatk FixMateInformation --TMP_DIR TEMP -I R3f_SAM_TO_BAM/$SAMPLE\.bam -O R3g_FIXMATE/$SAMPLE\.bam
- gatk SortSam --TMP_DIR TEMP -I R3g_FIXMATE/$SAMPLE\.bam -O R3h_SORT/$SAMPLE\.bam -SO coordinate
- gatk MarkDuplicates --TMP_DIR TEMP --VALIDATION_STRINGENCY LENIENT -I R3h_SORT/$SAMPLE\.bam -O R3q_REMOVE_DUPLICATES_FALSE/$SAMPLE\.bam --METRICS_FILE R3_STATS/$SAMPLE\_markDuplicatesFalse.txt --REMOVE_DUPLICATES false --ASSUME_SORTED true --CREATE_INDEX true
- gatk MarkDuplicates --TMP_DIR TEMP --VALIDATION_STRINGENCY LENIENT -I R3h_SORT/$SAMPLE\.bam -O R3j_REMOVE_DUPLICATES/$SAMPLE\.bam --METRICS_FILE R3_STATS/$SAMPLE\_markDuplicatesTrue.txt --REMOVE_DUPLICATES true --ASSUME_SORTED true --CREATE_INDEX tru
- gatk CollectHsMetrics --TMP_DIR TEMP --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/$SAMPLE\_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/$SAMPLE\_target.interval_list --INPUT R3q_REMOVE_DUPLICATES_FALSE/$SAMPLE\.bam --OUTPUT GATK4/$SAMPLE\_hs_metricsFalse.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --REFERENCE_SEQUENCE $REF -VALIDATION_STRINGENCY LENIENT --COVERAGE_CAP 100000
- gatk CollectHsMetrics --TMP_DIR TEMP --BAIT_INTERVALS R3n_BED_TO_INTERVAL_LIST/$SAMPLE\_bait.interval_list --BAIT_SET_NAME DESIGN --TARGET_INTERVALS R3n_BED_TO_INTERVAL_LIST/$SAMPLE\_target.interval_list --INPUT R3j_REMOVE_DUPLICATES/$SAMPLE\.bam --OUTPUT GATK4/$SAMPLE\_hs_metrics.txt --METRIC_ACCUMULATION_LEVEL ALL_READS --REFERENCE_SEQUENCE $REF -VALIDATION_STRINGENCY LENIENT --COVERAGE_CAP 100000 --PER_BASE_COVERAGE GATK4/$SAMPLE\_per_base_coverage.txt
-
Hi JonR, thanks for the clarification. Would you be able to follow these instructions and upload the files you are using so we can test it on our end?
-
Ok file uploaded.
GATKBUG-Brandt-JonR.tar.gz
Check out the readme contained within. Simply activate the GATK conda environment, run the script and you should have all the files created from the original bam.
-
Thank you for uploading those files, JonR, we will look into it and get back to you.
-
Hi JonR, thank you for your patience as we look into this. I spoke with a Picard developer, and found that I previously misspoke. The FOLD_80_BASE_PENALTY and PCT_TARGET_BASES_20X are not ignoring duplicates, the coverage that ignores duplicates has "UNIQUE" or "UQ" in the name. So, there is the explanation for why you are getting different coverage values with --remove-duplicates true and false.
Would you also like us to look into these methods changing in the newer GATK versions? If so, please upload all the input files for MarkDuplicates and CollectHsMetrics, so that we do not have to run any of the other steps. Also please include the sample coverage output file for all of these tests so that I can see the final output on your end.
Please sign in to leave a comment.
16 comments