Reproduce GATK Tutorial#11682 leading to different results
REQUIRED for all errors and issues:
a) GATK version used: docker image spacecade7/tutorial_11682_11683:gatk4.0.1.1
b) Exact command used:
tutorial1=./gatk_CNV/tutorial/tutorial_11682
tutorial2=./gatk_CNV/tutorial/tutorial_11683
out=./gatk_CNV/tutorial/out
# input files
interval=$tutorial1/targets_C.interval_list
REF=./gatk_CNV/tutorial/Homo_sapiens_assembly38.fasta
REF_dict=./gatk_CNV/tutorial/tutorial_11682/Homo_sapiens_assembly38.dict
BAM=$tutorial1/tumor.bam
SAMPLE="test"
normal1=$tutorial1/HG00133.alt_bwamem_GRCh38DH.20150826.GBR.exome.counts.hdf5
normal2=$tutorial1/HG00733.alt_bwamem_GRCh38DH.20150826.PUR.exome.counts.hdf5
normal3=$tutorial1/NA19654.alt_bwamem_GRCh38DH.20150826.MXL.exome.counts.hdf5
# 1. Collect raw counts data with PreprocessIntervals and CollectFragmentCounts
$gatk_path PreprocessIntervals \
-L$interval\
-R$REF\
--bin-length0\
--interval-merging-ruleOVERLAPPING_ONLY\
-O$out/$SAMPLE.preprocessed.interval_list
$gatk_path CollectFragmentCounts \
-I$BAM\
-L$out/$SAMPLE.preprocessed.interval_list\
--interval-merging-ruleOVERLAPPING_ONLY\
-O$out/$SAMPLE.counts.hdf5
# 2. Generate a CNV panel of normals with CreateReadCountPanelOfNormals
$gatk_path --java-options "-Xmx6500m" CreateReadCountPanelOfNormals \
-I$normal1\
-I$normal2\
-I$normal3\
--minimum-interval-median-percentile5.0\
-O$out/cnvponC.pon.hdf5
# 3. Standardize and denoise case read counts against the PoN with DenoiseReadCounts
# Regularly, we will use the output from CollectFragmentCounts step
# out_tumor_count=$out/test.counts.hdf5
out_tumor_count=$tutorial1/hcc1143_T_clean.counts.hdf5
# Use tutorial files
# $gatk_path --java-options "-Xmx12g" DenoiseReadCounts \
# -I $out_tumor_count \
# # This step seems to make the difference; self-generated pon or tutorial pon
# --count-panel-of-normals $out/cnvponC.pon.hdf5 \
# --standardized-copy-ratios $tutorial1/hcc1143_T_clean.standardizedCR.tsv \
# --denoised-copy-ratios $tutorial1/hcc1143_T_clean.denoisedCR.tsv
# Use pipeline produced files
$gatk_path --java-options "-Xmx12g" DenoiseReadCounts \
-I$out_tumor_count\
--count-panel-of-normals$out/cnvponC.pon.hdf5\
--standardized-copy-ratios$out/hcc1143_T_clean.standardizedCR.tsv\
--denoised-copy-ratios$out/hcc1143_T_clean.denoisedCR.tsv
# 4. Plot standardized and denoised copy ratios with PlotDenoisedCopyRatios
mkdir -p $out/plots
# # Use tutorial files
# $gatk_path PlotDenoisedCopyRatios \
# --standardized-copy-ratios $tutorial1/hcc1143_T_clean.standardizedCR.tsv \
# --denoised-copy-ratios $tutorial1/hcc1143_T_clean.denoisedCR.tsv \
# --sequence-dictionary $REF_dict \
# --minimum-contig-length 46709983 \
# --output $out/plots \
# --output-prefix hcc1143_T_clean
$gatk_path PlotDenoisedCopyRatios \
--standardized-copy-ratios$out/hcc1143_T_clean.standardizedCR.tsv\
--denoised-copy-ratios$out/hcc1143_T_clean.denoisedCR.tsv\
--sequence-dictionary$REF_dict\
--minimum-contig-length46709983\
--output$out/plots\
--output-prefixhcc1143_T_clean
c) Entire program log:
Brief issue description:
When following the tutorial https://gatk.broadinstitute.org/hc/en-us/articles/360035531092--How-to-part-I-Sensitively-detect-copy-ratio-alterations-and-allelic-segments, the #4 Plot standardized and denoised copy ratios with PlotDenoisedCopyRatios have different results than the tutorial. Through the control vectors test, it seems that the samples that are used in step #2 to generate CNV PON used in the tutorial are different from the files stored in the tutorial. It could also be step #3 DenoiseReadCounts produced different values inside hcc1143_T_clean.standardizedCR.tsv
and hcc1143_T_clean.denoisedCR.tsv
compared to the results stored in the tutorial. This could lead to a possible cause of the file hcc1143_T_clean.counts.hdf5
provided in the tutorial.
Results:
Following steps 1 to 4, the results have values 0.164 and 0.137. However, the values in the tutorial are 0.134 and 0.125.
-
Hi Yuwei Bao
It is highly likely that files we provided for users to try are different from what we actually used in generating the data and images in the tutorial due restrictions on data availability therefore it is normal that you are observing these differences. The data that you are using yourself in this tutorial is probably the one that our team simulated in-silico years ago. If you get the very same result that you observe with each repetition then you are good to go.
Regards.
-
Hi Gökalp:
Thank you very much for confirming!
Could you recommend resources to understand and explain these CNV copy ratio results? I am new to CNV analysis. Running other pipelines results in the location and number of copies presented in a region, while gatk pipeline leads to ratios and plots. I don't know how to interpret these results jointly.
Best,
Yuwei -
We do not have a joint segmentation workflow for Somatic CNVs however our outputs of .seg files are a tab separated format which you may use easily. They are the final segmentation of copy ratios of which you can check the regions where copy number variations are present. LOG2 ratios are basically linear ratios expressed in LOG2 scale which you may also feed into other segmentation algorithms if desired such as DNAcopy and see your segmentation results from a different angle.
I hope this helps.
Please sign in to leave a comment.
3 comments