GATKgCNV Error: "Some contigs do not have ploidy priors"
The information is listed below:
a) GATK version used: 4.4.0.0 using a singularity image whose path is stored under `singularity_image`.
b) Exact commands used are below. I had a previous posting which dealt with hg38 but this takes the UCSC hg19 fasta. It takes the Agilent bed file and then a `targets_interval_list` made using it and a sequence dictionary under the variable `sequence_dict`. The bam directory given is stored under the variable `BAM_DIR`.
#____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
# Making a Preprocessed_Interval_Lists Folder:
mkdir -p "${output_dir}/Step_1___Preprocessed_Interval_Lists"
preprocessed_interval_dir="${output_dir}/Step_1___Preprocessed_Interval_Lists"
# Making a AnnotateIntervals Folder:
mkdir -p "${output_dir}/Step_2___AnnotateIntervals"
annotate_dir="${output_dir}/Step_2___AnnotateIntervals"
# Making a CollectReadCounts Folder:
mkdir -p "${output_dir}/Step_3___CollectReadCounts"
counts_dir="${output_dir}/Step_3___CollectReadCounts"
# Making a Filtered Interval folder:
mkdir -p "${output_dir}/Step_4___Filtered_Intervals"
filtered_intervals_dir="${output_dir}/Step_4___Filtered_Intervals"
# Making a Determine Germline Contig Ploidy Folder:
mkdir -p "${output_dir}/Step_5___DetermineGermlineContigPloidy"
DetermineGermlineContigPloidy_dir="${output_dir}/Step_5___DetermineGermlineContigPloidy"
# Make a Split intervals folder:
mkdir -p ${output_dir}/Step_6___Split_Intervals
SplitIntervals_dir="${output_dir}/Step_6___Split_Intervals"
# Making a GermlineCNVCaller Folder:
mkdir -p "${output_dir}/Step_7___GermlineCNVCaller"
GermlineCNVCaller_dir="${output_dir}/Step_7___GermlineCNVCaller"
# Making a PostprocessGermlineCNVCalls Folder:
mkdir -p "${output_dir}/Step_8___PostprocessGermlineCNVCalls"
PostprocessGermlineCNVCalls_dir="${output_dir}/Step_8___PostprocessGermlineCNVCalls"
echo " "
echo "All sub-folders created"
echo " "
echo "____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________"
#____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
# Step 1: Preprocessing
echo " "
echo "Starting Step 1: Preprocessing"
echo " "
singularity exec $singularity_image /gatk/gatk PreprocessIntervals \
--reference ${reference} \
--intervals ${targets_interval_list} \
--bin-length ${NUM_BINS} \
--interval-merging-ruleOVERLAPPING_ONLY\
--sequence-dictionary ${sequence_dict} \
--output"${preprocessed_interval_dir}/PreprocessIntervals.interval_list"
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Step 2: AnnotateIntervals with GC content
echo " "
echo "Starting Step 2: AnnotateIntervals with GC content"
echo " "
singularity exec $singularity_image /gatk/gatk AnnotateIntervals \
--intervals"${preprocessed_interval_dir}/PreprocessIntervals.interval_list"\
--reference ${reference} \
--interval-merging-ruleOVERLAPPING_ONLY\
--sequence-dictionary ${sequence_dict} \
--output"${annotate_dir}/Annotate_Intervals.tsv"\
--mappability-track"${bed_file}"
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Concatenate all the bam files with -I flag (done):
echo " "
echo "Making a -I flag for the bam directory"
echo " "
bam_arguments=$(printf -- "-I %s\n" "${BAM_DIR}"/*.bam)
echo "${bam_arguments}"
echo " "
echo "Those were all the bam files with -I "
echo " "
# _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Step 3: Collect Read Counts:
echo " "
echo "Starting Step 3: CollectReadCounts"
echo " "
# Create an array to store the BAM file paths:
bam_files=()
# Iterate over each BAM file in BAM_DIR and add it to the array:
for bam_file in "$BAM_DIR"/*.bam; do
bam_files+=("$bam_file")
done
# Count the number of BAM files in the directory:
num_bams=$(find "$BAM_DIR" -type f -name "*.bam" | wc -l)
echo " "
echo "Number of BAM files: $num_bams"
echo " "
# Run CollectReadCounts on each BAM file:
for bam_file in "${bam_files[@]}"; do
singularityexec$singularity_image/gatk/gatkCollectReadCounts\
--input"$bam_file"\
--sequence-dictionary ${sequence_dict} \
--intervals"${preprocessed_interval_dir}/PreprocessIntervals.interval_list"\
--reference"${reference}"\
--interval-merging-ruleOVERLAPPING_ONLY\
--formatTSV\
--output"${counts_dir}/$(basename "$bam_file").counts.tsv"
done
echo " "
echo "Step 3: CollectReadCounts completed"
echo " "
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Concatenate all the counts files with -I flag:
counts_arguments=$(printf -- "-I %s " "${counts_dir}"/*.tsv)
echo " "
echo "These are the CollectReadCounts file(s) with the flag -I below:"
echo " "
echo "${counts_arguments}"
echo " "
echo "This is all the CollectReadCounts file(s) with the flag -I."
echo " "
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Concatenate the interval list for --annotate-intervals parameter (should be 1 file but just in case):
annotated_intervals_list=$(printf -- "--annotated-intervals %s " "${annotate_dir}"/*.tsv)
echo "${annotated_intervals_list}"
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Step 4: FilterIntervals based on GC-content and read:
echo " "
echo "Starting Step 4: FilterIntervals"
echo " "
singularity exec $singularity_image /gatk/gatk FilterIntervals \
--intervals ${preprocessed_interval_dir}/PreprocessIntervals.interval_list\
${annotated_intervals_list} \
${counts_arguments} \
--interval-merging-ruleOVERLAPPING_ONLY\
--output"${filtered_intervals_dir}/Filter_Intervals.interval_list"
echo "Finished up to FilterIntervals"
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Step 5: DetermineGermlineContigPloidy:
echo " "
echo "Starting Step 5: DetermineGermlineContigPloidy"
echo " "
singularity exec $singularity_image /gatk/gatk DetermineGermlineContigPloidy \
--intervals"${filtered_intervals_dir}/Filter_Intervals.interval_list"\
--interval-merging-ruleOVERLAPPING_ONLY\
${counts_arguments} \
--contig-ploidy-priors ${contig_ploidy_priors} \
--output-prefixploidy\
--output ${DetermineGermlineContigPloidy_dir}
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Step 6: SplitIntervals:
echo " "
echo "Starting Step 6: SplitIntervals - Creating shards"
echo " "
# Adding in Interval Count
singularity exec $singularity_image /gatk/gatk SplitIntervals \
--reference ${reference} \
${filtered_intervals_list} \
${bam_arguments} \
--scatter-count ${NUM_BINS} \
--interval-merging-ruleOVERLAPPING_ONLY\
--subdivision-modeINTERVAL_COUNT\
--output"${SplitIntervals_dir}"
# Concatenate all the split interval files with -L flag:
shard_arguments=$(printf -- "-L %s " "${SplitIntervals_dir}"/*.interval_list)
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Create variable `sample_counter` to identify/index number of samples in the call.
samplecount=${num_bams} # Get the count of samples - 1 bam == 1 sample
samplecount=$((samplecount - 1)) # Decrement samplecount by 1
echo " "
echo "The sample count used for the final step is one less than the number of bams."
echo "The number of bams: $num_bams"
echo "The sample count is: $samplecount"
echo " "
counter=0
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Concatenate calls shard path .tsv files to pass in as --calls-shard-path, List of paths to GermlineCNVCaller model directories.:
calls_shard_list=$(printf -- "--calls-shard-path %s " "${GermlineCNVCaller_dir}/germline_cnv_caller_cohort-calls"/*)
echo "${calls_shard_list}"
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Concatenate models shard path .tsv files to pass in as --model-shard-path, List of paths to GermlineCNVCaller model directories.:
model_shard_list=$(printf -- "--model-shard-path %s " "${GermlineCNVCaller_dir}/germline_cnv_caller_cohort-model"/*)
echo "${model_shard_list}"
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Concatenate all the counts files in an array
Counts_Arguments_To_Be_Fed_Into_Step_7_GermlineCNVCaller=("${counts_dir}"/*.tsv)
# Concatenate all the shard files in an array
Shard_Files_To_Be_Fed_Into_Step_7_GermlineCNVCaller=("${SplitIntervals_dir}"/*.interval_list)
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Step 7: GermlineCNVCaller
echo " "
echo "Starting Step 7: GermlineCNVCaller"
echo " "
function run_GermlineCNVCaller() {
localannotated_intervals_list=$1
# Loop through each shard file path in Shard_Files_To_Be_Fed_Into_Step_7_GermlineCNVCaller
forshard_filein"${Shard_Files_To_Be_Fed_Into_Step_7_GermlineCNVCaller[@]}"; do
forcounts_argin"${Counts_Arguments_To_Be_Fed_Into_Step_7_GermlineCNVCaller[@]}"; do
echo"Processing: $counts_arg with annotated intervals: ${annotate_dir}/Annotate_Intervals.tsv"
shard_name=$(basename "$shard_file")
singularitycommand="singularity exec $singularity_image /gatk/gatk"
${singularitycommand} GermlineCNVCaller \
--contig-ploidy-calls"${DetermineGermlineContigPloidy_dir}/ploidy-calls"\
-I"$counts_arg"\
-L"$shard_file"\
--annotated-intervals"${annotate_dir}/Annotate_Intervals.tsv"\
-output"${GermlineCNVCaller_dir}"\
--output-prefixStep_7___GermlineCNVCaller___${counts_arg}_${shard_name}______\
--interval-merging-ruleOVERLAPPING_ONLY\
--run-modeCOHORT\
--verbosityDEBUG
done
done
}
run_GermlineCNVCaller
# Wait for all background processes to finish
wait
c) The error:
Stderr: Traceback (most recent call last):
File "/tmp/cohort_determine_ploidy_and_depth.1548328046736995123.py", line 121, in <module>
sample_metadata_collection)
File "/path/to/python3.6/site-packages/gcnvkernel/models/model_ploidy.py", line 146, in __init__
"Some contigs do not have ploidy priors"
AssertionError: Some contigs do not have ploidy priors
at org.broadinstitute.hellbender.utils.python.PythonExecutorBase.getScriptException(PythonExecutorBase.java:75)
at org.broadinstitute.hellbender.utils.runtime.ScriptExecutor.executeCuratedArgs(ScriptExecutor.java:112)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeArgs(PythonScriptExecutor.java:193)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:168)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:139)
at org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy.executeDeterminePloidyAndDepthPythonScript(DetermineGermlineContigPloidy.java:427)
at org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy.doWork(DetermineGermlineContigPloidy.java:324)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:149)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /gatk/gatk-package-4.4.0.0-local.jar
What does this mean as the contig ploidy priors table is what was recommended (also pasted below):
CONTIG_NAME PLOIDY_PRIOR_0 PLOIDY_PRIOR_1 PLOIDY_PRIOR_2 PLOIDY_PRIOR_3
1 0.01 0.01 0.97 0.01
2 0.01 0.01 0.97 0.01
3 0.01 0.01 0.97 0.01
4 0.01 0.01 0.97 0.01
5 0.01 0.01 0.97 0.01
6 0.01 0.01 0.97 0.01
7 0.01 0.01 0.97 0.01
8 0.01 0.01 0.97 0.01
9 0.01 0.01 0.97 0.01
10 0.01 0.01 0.97 0.01
11 0.01 0.01 0.97 0.01
12 0.01 0.01 0.97 0.01
13 0.01 0.01 0.97 0.01
14 0.01 0.01 0.97 0.01
15 0.01 0.01 0.97 0.01
16 0.01 0.01 0.97 0.01
17 0.01 0.01 0.97 0.01
18 0.01 0.01 0.97 0.01
19 0.01 0.01 0.97 0.01
20 0.01 0.01 0.97 0.01
21 0.01 0.01 0.97 0.01
22 0.01 0.01 0.97 0.01
X 0.01 0.49 0.49 0.01
The config ploidy priors table never gave an error before. Why is this error occurring?
-
Hi Y R
Contig ploidy file provided for the gCNV workflow should contain all the contigs that are used to collect readcounts from the alignment files and their names should match exactly to the contig names on the reference genome. Can you check the contig names present in your interval_list file and your contig ploidy files? If they are not matching make sure that they match and try rerunning the tool.
I hope this helps.
-
I think I see part of the issue. The chromosome names are chrM, chr1, etc in the interval_list file and in the contig ploidy priors has 1,2,3 etc. What would I put in for the chrM in the contig ploidy priors table?
-
Hi Y R
Since mitochondrial chromosome is not similar to autosomal or gonosomal chromosomes in terms of copy number and behavior it is not recommended to have chrM to be included in gCNV analysis.
It may be better for you to remove chrM counts from your readcounts. You can do this by adding
-XL chrM
to your command line during your analysis.
-
To which command?
-
You may add it to GermlineCNVCaller step.
-
I am adding it here as the following:
# Step 7: GermlineCNVCaller
echo " "
echo "Starting Step 7: GermlineCNVCaller"
echo " "
function run_GermlineCNVCaller() {
localannotated_intervals_list=$1
# Loop through each shard file path in Shard_Files_To_Be_Fed_Into_Step_7_GermlineCNVCaller
forshard_filein"${Shard_Files_To_Be_Fed_Into_Step_7_GermlineCNVCaller[@]}"; do
forcounts_argin"${Counts_Arguments_To_Be_Fed_Into_Step_7_GermlineCNVCaller[@]}"; do
echo"Processing: $counts_arg with annotated intervals: ${annotate_dir}/Annotate_Intervals.tsv"
shard_name=$(basename "$shard_file")
singularitycommand="singularity exec $singularity_image /gatk/gatk"
${singularitycommand} GermlineCNVCaller \
--contig-ploidy-calls"${DetermineGermlineContigPloidy_dir}/ploidy-calls"\
-I"$counts_arg"\
-L"$shard_file"\
--annotated-intervals"${annotate_dir}/Annotate_Intervals.tsv"\
-output"${GermlineCNVCaller_dir}"\
--output-prefixStep_7___GermlineCNVCaller___${counts_arg}_${shard_name}______\
--interval-merging-ruleOVERLAPPING_ONLY\
--run-modeCOHORT\
-XLchrM\
--verbosityDEBUG
done
done
}Question(s):
1) Is there any difference between `-XL` and `--exclude-intervals`?
2) Should I add the `chr` to the contig ploidy priors? Or should it not contain `chr`?
-
Names in contig ploidy priors should match what's in the actual sequence names in the reference file. If your file has chr prefix then contig ploidy priors should have chr.
-
Then the contig ploidy priors table would not include the `chrM` but be the following?:
CONTIG_NAME PLOIDY_PRIOR_0 PLOIDY_PRIOR_1 PLOIDY_PRIOR_2 PLOIDY_PRIOR_3
chr1 0.01 0.01 0.97 0.01
chr2 0.01 0.01 0.97 0.01
chr3 0.01 0.01 0.97 0.01
chr4 0.01 0.01 0.97 0.01
chr5 0.01 0.01 0.97 0.01
chr6 0.01 0.01 0.97 0.01
chr7 0.01 0.01 0.97 0.01
chr8 0.01 0.01 0.97 0.01
chr9 0.01 0.01 0.97 0.01
chr10 0.01 0.01 0.97 0.01
chr11 0.01 0.01 0.97 0.01
chr12 0.01 0.01 0.97 0.01
chr13 0.01 0.01 0.97 0.01
chr14 0.01 0.01 0.97 0.01
chr15 0.01 0.01 0.97 0.01
chr16 0.01 0.01 0.97 0.01
chr17 0.01 0.01 0.97 0.01
chr18 0.01 0.01 0.97 0.01
chr19 0.01 0.01 0.97 0.01
chr20 0.01 0.01 0.97 0.01
chr21 0.01 0.01 0.97 0.01
chr22 0.01 0.01 0.97 0.01
chrX 0.01 0.49 0.49 0.01
chrY 0.50 0.50 0.00 0.00Question(s):
1) This is what I believe I understand of what you mentioned is this correct?
2) Is there any difference between `-XL` and `--exclude-intervals`?
3) Now I get the error:
A USER ERROR has occurred: Argument -L,-XL has a bad value: path/to/0000-scattered.interval_list], [chrM]. The intervals specified for exclusion with -XL removed all territory specified by -L.
Why is this happening?
-
Hi Y R
Looks like some of your scatter files have intervals from chrM and -XL chrM removes all those intervals. So my first assumption may not be complete solution therefore you need to remove all chrM intervals from your initial intervallist file and from your collected readcount files as well. This means you need to recollect your readcounts with an intervallist file that does not have intervals from chrM.
-
I removed the `chrM` from the bed file by using the following command:
awk '$1 != "chrM"' input.bed > output.bed
then ran the ` BedToIntervalList` to make a fresh interval list so the targets interval list would not have `chrM`. Then I added the new paths for the interval list and the bed list to the GATKgCNV script but I get this error
A USER ERROR has occurred: Argument -L,-XL has a bad value: [path/to/Step_6___Split_Intervals/0000-scattered.interval_list], [chrM]. The intervals specified for exclusion with -XL removed all territory specified by -L.
Why is this occurring?
-
Hi Y R
Looks like there is still chrM somewhere along the files. You may need to refresh all count files based on new interval list without chrM. Therefore you won't need to use -XL chrM as a parameter anymore.
-
When I remove the `-XL chrM` from the script I get this error (after a fresh run):
14:35:59.472 INFO GermlineCNVCaller - Shutting down engine
[September 14, 2023 at 2:35:59 PM GMT] org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller done. Elapsed time: 1.03 minutes.
Runtime.totalMemory()=1224736768
java.lang.IllegalArgumentException: Annotated intervals do not contain all specified intervals.
at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:798)
at org.broadinstitute.hellbender.tools.copynumber.arguments.CopyNumberArgumentValidationUtils.validateAnnotatedIntervalsSubset(CopyNumberArgumentValidationUtils.java:261)
at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.resolveIntervals(GermlineCNVCaller.java:406)
at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.doWork(GermlineCNVCaller.java:329)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:149)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)I am using the UCSC hg19 fasta, and the
Agilent_Version_6_Regions_Modified_chr_M bed file. Why is this happening? -
Did you update the annotation file for your new fresh intervals?
The error message is clearly indicating an incompatibility between annotated intervals and shards.
-
The annotation file is created in the GATKgCNV script in the first posting. So yes it is new. I did do a clean run but I am still getting the error.
-
Hi Y R
Annotated intervals and interval list files should be 1:1 match even when it is divided into shards. In the last topic I remember mentioning this specifically since divided intervals are not the same as the original undivided ones in the annotated file.
For example lets say we have an interval list file and its annotation.
Interval_list file
chr1 10000 20000
chr1 20001 30000Annotated File
chr1 10000 20000 0.9
chr1 20001 30000 0.8If the interval list file is divided into 2 shards like
shard 1
chr1 10000 20000
shard 2
chr1 20001 30000
These shards will still be compatible with the annotation file since all intervals are 1:1 present in the annotation file.
If those shards are set like this
shard 1
chr1 10000 15000
chr1 15001 20000shard 2
chr1 20001 25000
chr1 25001 30000then these intervals although they cover the same region they are not compatible with the annotation file. Sharding must be done properly in order to be completely compatible with the original annotation file. Splitting intervals into smaller intervals must be done before the annotation and sharding must be done only by spliting the whole interval list file based on number of complete regions if needed.
Since we don't have access to your resources we cannot definitively say the reason but we can only act based on the error message thrown.
You may need to run your workflow step by step and at the end of each step you need to check your inputs and outputs.
We hope this helps.
-
Hello Gökalp Çelik I tracked back to what you mentioned about removing chrM from the interval list doing the following:
grep -Ev '^\s*@SQ\s+SN:chrM\s*' input.interval_list > output.interval_list
But when I do this and then run the GATKgCNV I get the error:
A USER ERROR has occurred: Argument -L,-XL has a bad value: path/to/0000-scattered.interval_list], [chrM]. The intervals specified for exclusion with -XL removed all territory specified by -L.
I did a fresh re-run with making a `.idx`, `.interval_list` and then removed chrM. Why doesn't this solve the issue?
-
Hi Y R
It is hard to tell why but looks like there are issues the way you are processing your interval_list files. Again since we don't have access to your resources we cannot definitively tell why this is happening. Preprocessing of interval_lists must be done way before you collect your readcounts.
I am posting a sample interval_list preprocessed for the agilent v6 kit that I regularly use for my samples. I do not modify anything out of this file and use this sole file to collect all my readcounts for this kit and perform CNV analyses.
Once again let me remind you the steps and you may be able to debug your code since we could not get definitive command line parameters but only script files that are hard to evaluate just by looking at it.
Step 1:
Preprocess your interval_list. Remove N only regions, add padding to extend regions and divide larger regions into smaller ones if needed. Avoid anything not autosomal or gonosomal such as chrM.
Step 1.1:
Create a contig ploidy file for all the contigs you are interested in. Do not include chrM in the list since it is not really the focus of GCNV workflow.
Step 2:
Annotate your preprocessed interval_list file. This is the only step where you will annotate your file and you cannot change it later on.
Step 3:
Collect Readcounts from all your files using the interval_list file created in the step 1:
Step 4:
Run FilterIntervals using all the readcount files at once and your interval_list and annotation file to generate a filtered interval_list.
Step 5
Run DetermineContigPloidy using the filtered interval_list from step 4, contig ploidy file from step 1.1 and all readcounts files together in a single step.
Step 5.5 (optional)
Scatter your filtered interval_list if you need to. When scattering avoid splitting single intervals into multiples. Only scatter based on number of segments in each scatter.
Step 6.
Run GermlineCNVCaller using filtered interval_list from step 4 (or if you scatter your filtered interval_list use each scattered interval list seperately from step 5.5), annotation file from step 2, config ploidy calls from step 5, and all readcount files all together in a single step. If you use scattered interval_lists from step 5.5 each scattered file must be used together with annotation file from step 2, contig ploidy calls from step 5 annotation file from step 2 and all readcount files together in scattered interval_lists number of steps seperately with their results in seperate folders. These seperate folders will be called GCNV shards.
Step 7.
Run PostProcessGermlineCNVCalls step using GCNV calls folders, GCNV model folders from step 6, contig ploidy calls from step 5 and a single sample index which is generated when your readcounts were put into calls in step 5 and 6. Make sure that your samples are all in the same order when they are consumed by step 5 and 6. Sample index is a single integer number and you may be able to see them in the logs. This step must be run for each sample seperately. If you have used scattered interval_lists from step 5.5 in step 6 then you need to add all GCNV model and call folders (shards) into the command line.
Here is the link to the repository of my sample interval_list files for agilent v6 kit that I use for my agilent v6 captured samples.
https://github.com/gokalpcelik/intervallists
I hope this helps.
-
I am not sure I full understand but I wanted to add the following. I removed the chrM from the fasta and created a clean intervals list without chrM and then this created an error when
A USER ERROR has occurred: Illegal argument value: Positional arguments were provided ',ploidy}' but no positional argument is defined for this tool.` from DetermineGermlineContigPloidy.
I am sure after this I will have issues with interval list but I am not sure what to do with this error. My contig ploidy priors table is as follows:
CONTIG_NAME PLOIDY_PRIOR_0 PLOIDY_PRIOR_1 PLOIDY_PRIOR_2 PLOIDY_PRIOR_3
chr1 0.01 0.01 0.97 0.01
chr2 0.01 0.01 0.97 0.01
chr3 0.01 0.01 0.97 0.01
chr4 0.01 0.01 0.97 0.01
chr5 0.01 0.01 0.97 0.01
chr6 0.01 0.01 0.97 0.01
chr7 0.01 0.01 0.97 0.01
chr8 0.01 0.01 0.97 0.01
chr9 0.01 0.01 0.97 0.01
chr10 0.01 0.01 0.97 0.01
chr11 0.01 0.01 0.97 0.01
chr12 0.01 0.01 0.97 0.01
chr13 0.01 0.01 0.97 0.01
chr14 0.01 0.01 0.97 0.01
chr15 0.01 0.01 0.97 0.01
chr16 0.01 0.01 0.97 0.01
chr17 0.01 0.01 0.97 0.01
chr18 0.01 0.01 0.97 0.01
chr19 0.01 0.01 0.97 0.01
chr20 0.01 0.01 0.97 0.01
chr21 0.01 0.01 0.97 0.01
chr22 0.01 0.01 0.97 0.01
chrX 0.01 0.49 0.49 0.01
chrY 0.50 0.50 0.00 0.00Why would this occur? The relevant section of the code is placed below:# Step 5: DetermineGermlineContigPloidy:
echo " "
echo "Starting Step 5: DetermineGermlineContigPloidy"
echo " "
singularity exec $singularity_image_path /gatk/gatk DetermineGermlineContigPloidy \
--intervals"${filtered_intervals_dir}/Filter_Intervals.interval_list"\
--interval-merging-ruleOVERLAPPING_ONLY\
${counts_arguments} \
--contig-ploidy-priors ${contig_ploidy_priors} \
--output-prefixploidy\
--output ${DetermineGermlineContigPloidy_dir}I would give the example SplitInterval files if it were not for this interesting error.
-
This is a script issue not GATK actually. Your script is passing a malformed input to a parameter.
By the way singularity is not supported as en executor for GATK containers therefore if you are facing any issues related to singularity we cannot solve them.
-
Thank you for the response. It isn't an issue with singularity so it should be fine. The only reason I use singularity is because that currently is the only way I can use GATK on the HPC I am on. The issues I have are with the inputs I use/script but not singularity.
-
Could I ask if this ( https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy;tab=objects?prefix=&forceOnObjectsSortingFiltering=false ) with the file called `Homo_sapiens_assembly19_1000genomes_decoy.fasta` can be used. Why is it called "decoy"? Can it not be used instead of the ucsc hg19 fasta I have been using? What about for hg38 ( https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/Homo_sapiens_assembly38_noALT_noHLA_noDecoy/v0?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false). Can the `Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta` be used. Why is it called noDecoy?
-
Hi Y R
Sorry for the misunderstanding on my end. I did not mean that your problems are related to singularity but if you also face any issues about singularity then we cannot provide support for it because we only support docker for container runtime.
For the reference genome part you need to check the sequence dictionaries to match in order to use a particular reference file for your workflows. Certain reference genomes could be generated for specific purposes however our main recommendations are documented in several pages
Of course the choice is yours but beware that if you use resource files provided by us then keep in mind that our resource files are based on our recommended reference genomes for GRCh37 an GRCh38. You may confirm this via checking the header sections of vcf and interval_list files provided in the resource bundle.
-
Hello Gökalp Celik,
I used the references recommended by GATK for hg19 and then I got a peculiar error dealing with the contig ploidy priors table:
The following mandatory columns could not be found in "path/to/chr_contig_ploidy_priors.tsv"; cannot continue: {'CONTIG_NAME'}
at org.broadinstitute.hellbender.utils.python.PythonExecutorBase.getScriptException(PythonExecutorBase.java:75)
at org.broadinstitute.hellbender.utils.runtime.ScriptExecutor.executeCuratedArgs(ScriptExecutor.java:112)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeArgs(PythonScriptExecutor.java:193)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:168)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:139)
at org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy.executeDeterminePloidyAndDepthPythonScript(DetermineGermlineContigPloidy.java:427)
at org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy.doWork(DetermineGermlineContigPloidy.java:324)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:149)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /gatk/gatk-package-4.4.0.0-local.jarWhy is this occurring. My contig ploidy priors table pasted below:CONTIG_NAME PLOIDY_PRIOR_0 PLOIDY_PRIOR_1 PLOIDY_PRIOR_2 PLOIDY_PRIOR_3
chr1 0.01 0.01 0.97 0.01
chr2 0.01 0.01 0.97 0.01
chr3 0.01 0.01 0.97 0.01
chr4 0.01 0.01 0.97 0.01
chr5 0.01 0.01 0.97 0.01
chr6 0.01 0.01 0.97 0.01
chr7 0.01 0.01 0.97 0.01
chr8 0.01 0.01 0.97 0.01
chr9 0.01 0.01 0.97 0.01
chr10 0.01 0.01 0.97 0.01
chr11 0.01 0.01 0.97 0.01
chr12 0.01 0.01 0.97 0.01
chr13 0.01 0.01 0.97 0.01
chr14 0.01 0.01 0.97 0.01
chr15 0.01 0.01 0.97 0.01
chr16 0.01 0.01 0.97 0.01
chr17 0.01 0.01 0.97 0.01
chr18 0.01 0.01 0.97 0.01
chr19 0.01 0.01 0.97 0.01
chr20 0.01 0.01 0.97 0.01
chr21 0.01 0.01 0.97 0.01
chr22 0.01 0.01 0.97 0.01
chrX 0.01 0.49 0.49 0.01
chrY 0.50 0.50 0.00 0.00Is there something wrong with the contig ploidy priors table? -
Contig ploidy file is supposed to be a tab-separated file. Some text editors prefer adding spaces instead of tab so you need to check if your whitespaces are actually tab or space.
-
Regarding the previous suggestion:
Thank you for your suggestion. Your suggestion worked for the error.
Basic Information And Statement Of Problem:
I am currently test driving the script on 5 test bam files. Now I get the following error at the Germline CNV Caller step:
java.lang.IllegalArgumentException: Annotated intervals do not contain all specified intervals.
at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:798)
at org.broadinstitute.hellbender.tools.copynumber.arguments.CopyNumberArgumentValidationUtils.validateAnnotatedIntervalsSubset(CopyNumberArgumentValidationUtils.java:261)
at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.resolveIntervals(GermlineCNVCaller.java:406)
at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.doWork(GermlineCNVCaller.java:329)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:149)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /gatk/gatk-package-4.4.0.0-local.jarThis error seems to be related to the intervals related issues I was having before so I will attach the head for 2 to 3 files created by the intervals steps in the hopes that the header will reveal something about the interval issues.Step 1: Preprocessing Interval List (Only one file generated in this step)@HD VN:1.6
@SQ SN:chr1 LN:249250621 M5:1b22b98cdeb4a9304cb5d48026a85128UR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr2 LN:243199373 M5:a0d9851da00400dec1098a9255ac712eUR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr3 LN:198022430 M5:641e4338fa8d52a5b781bd2a2c08d3c3 UR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr4 LN:191154276 M5:23dccd106897542ad87d2765d28a19a1UR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr5 LN:180915260 M5:0740173db9ffd264d728f32784845cd7 UR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr6 LN:171115067 M5:1d3a93a248d92a729ee764823acbbc6bUR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr7 LN:159138663 M5:618366e953d6aaad97dbe4777c29375eUR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chrX LN:155270560 M5:7e0e2e580297b7764e31dbc80c2540ddUR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr8 LN:146364022 M5:96f514a9929e410c6651697bded59aec UR:file:/path/to/GATK/reference/hg19.faStep 2: Annotate Intervals (Only one file generated in this step):
@HD VN:1.6
@SQ SN:chr1 LN:249250621 M5:1b22b98cdeb4a9304cb5d48026a85128UR:file:/path/to/gatk/reference/hg19.fa
@SQ SN:chr2 LN:243199373 M5:a0d9851da00400dec1098a9255ac712eUR:file:/path/to/gatk/reference/hg19.fa
@SQ SN:chr3 LN:198022430 M5:641e4338fa8d52a5b781bd2a2c08d3c3 UR:file:/path/to/gatk/reference/hg19.fa
@SQ SN:chr4 LN:191154276 M5:23dccd106897542ad87d2765d28a19a1UR:file:/path/to/gatk/reference/hg19.fa
@SQ SN:chr5 LN:180915260 M5:0740173db9ffd264d728f32784845cd7 UR:file:/path/to/gatk/reference/hg19.fa
@SQ SN:chr6 LN:171115067 M5:1d3a93a248d92a729ee764823acbbc6bUR:file:/path/to/gatk/reference/hg19.fa
@SQ SN:chr7 LN:159138663 M5:618366e953d6aaad97dbe4777c29375eUR:file:/path/to/gatk/reference/hg19.fa
@SQ SN:chrX LN:155270560 M5:7e0e2e580297b7764e31dbc80c2540ddUR:file:/path/to/gatk/reference/hg19.fa
@SQ SN:chr8 LN:146364022 M5:96f514a9929e410c6651697bded59aec UR:file:/path/to/gatk/reference/hg19.faStep 3: Collect Read Counts
This was not included as these counts are per bam and there are as many `.tsv` outputs as there are input bams.
Step 4: Filter Interval (Only one `.interval_list`)
@HD VN:1.6
@SQ SN:chr1 LN:249250621 M5:1b22b98cdeb4a9304cb5d48026a85128UR:file:/path/to/gatk/reference/hg19.fa
@SQ SN:chr2 LN:243199373 M5:a0d9851da00400dec1098a9255ac712eUR:file:/path/to/gatk/reference/hg19.fa
@SQ SN:chr3 LN:198022430 M5:641e4338fa8d52a5b781bd2a2c08d3c3 UR:file:/path/to/gatk/reference/hg19.fa
@SQ SN:chr4 LN:191154276 M5:23dccd106897542ad87d2765d28a19a1UR:file:/path/to/gatk/reference/hg19.fa
@SQ SN:chr5 LN:180915260 M5:0740173db9ffd264d728f32784845cd7 UR:file:/path/to/gatk/reference/hg19.fa
@SQ SN:chr6 LN:171115067 M5:1d3a93a248d92a729ee764823acbbc6bUR:file:/path/to/gatk/reference/hg19.fa
@SQ SN:chr7 LN:159138663 M5:618366e953d6aaad97dbe4777c29375eUR:file:/path/to/gatk/reference/hg19.fa
@SQ SN:chrX LN:155270560 M5:7e0e2e580297b7764e31dbc80c2540ddUR:file:/path/to/gatk/reference/hg19.fa
@SQ SN:chr8 LN:146364022 M5:96f514a9929e410c6651697bded59aec UR:file:/path/to/gatk/reference/hg19.faStep 5: Determine Germline Contig Policy (Many Outputs)
As there are many outputs in this step I am not sure adding the header of any of them will be useful.
Step 6: Scattered Interval Lists (92 total interval lists)
For `0000-scattered.interval_list`:
@HD VN:1.6 SO:coordinate
@SQ SN:chr1 LN:249250621 M5:1b22b98cdeb4a9304cb5d48026a85128UR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr2 LN:243199373 M5:a0d9851da00400dec1098a9255ac712eUR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr3 LN:198022430 M5:641e4338fa8d52a5b781bd2a2c08d3c3 UR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr4 LN:191154276 M5:23dccd106897542ad87d2765d28a19a1UR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr5 LN:180915260 M5:0740173db9ffd264d728f32784845cd7 UR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr6 LN:171115067 M5:1d3a93a248d92a729ee764823acbbc6bUR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr7 LN:159138663 M5:618366e953d6aaad97dbe4777c29375eUR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chrX LN:155270560 M5:7e0e2e580297b7764e31dbc80c2540ddUR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr8 LN:146364022 M5:96f514a9929e410c6651697bded59aec UR:file:/path/to/GATK/reference/hg19.faFor `0001-scattered.interval_list`:
@HD VN:1.6 SO:coordinate
@SQ SN:chr1 LN:249250621 M5:1b22b98cdeb4a9304cb5d48026a85128UR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr2 LN:243199373 M5:a0d9851da00400dec1098a9255ac712eUR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr3 LN:198022430 M5:641e4338fa8d52a5b781bd2a2c08d3c3 UR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr4 LN:191154276 M5:23dccd106897542ad87d2765d28a19a1UR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr5 LN:180915260 M5:0740173db9ffd264d728f32784845cd7 UR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr6 LN:171115067 M5:1d3a93a248d92a729ee764823acbbc6bUR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr7 LN:159138663 M5:618366e953d6aaad97dbe4777c29375eUR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chrX LN:155270560 M5:7e0e2e580297b7764e31dbc80c2540ddUR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr8 LN:146364022 M5:96f514a9929e410c6651697bded59aec UR:file:/path/to/GATK/reference/hg19.faFor `0002-scattered.interval_list`:
@SQ SN:chr1 LN:249250621 M5:1b22b98cdeb4a9304cb5d48026a85128UR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr2 LN:243199373 M5:a0d9851da00400dec1098a9255ac712eUR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr3 LN:198022430 M5:641e4338fa8d52a5b781bd2a2c08d3c3 UR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr4 LN:191154276 M5:23dccd106897542ad87d2765d28a19a1UR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr5 LN:180915260 M5:0740173db9ffd264d728f32784845cd7 UR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr6 LN:171115067 M5:1d3a93a248d92a729ee764823acbbc6bUR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr7 LN:159138663 M5:618366e953d6aaad97dbe4777c29375eUR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chrX LN:155270560 M5:7e0e2e580297b7764e31dbc80c2540ddUR:file:/path/to/GATK/reference/hg19.fa
@SQ SN:chr8 LN:146364022 M5:96f514a9929e410c6651697bded59aec UR:file:/path/to/GATK/reference/hg19.faThere is no output I can see for Step 7 Germline CNV caller due to this error so none was posted in this response.
Note(s):
The paths are there in the real files but due to the HPC rules I have to follow I have removed them in this post.
-
Headers are just sequence dictionaries that those are not really helpful for your case. You need to check if all scattered intervals (regions to be specific) is exactly present (not as a subpart of an interval but actually as it is) in the unscattered interval list in the input file for the Scattering tool that you are using.
Lets say if you have an interval like this in the filtered intervallist
chr1 100000 120000
but have
chr1 100000 110000
chr1 110001 120000in your scattered file then you will get this error because none of your read collections actually have the bottom regions as a readcount value.
-
I am not sure I fully understand but I can do a `wc -l` for each interval and check the `tail` ends here:
Step 1: Preprocess Intervals
Word Count:
`wc -l` produces 781866
The tail:
chr21 48081660 48081859 + .
chr21 48081860 48082059 + .
chr21 48082060 48082100 + .
chr21 48083043 48083242 + .
chr21 48083243 48083442 + .
chr21 48083443 48083642 + .
chr21 48083643 48083718 + .
chr21 48083955 48084154 + .
chr21 48084155 48084354 + .
chr21 48084355 48084536 + .Step 2: Annotate IntervalsWord Count:`wc -l` produces 781867
The tail:chr21 48081660 48081859 0.690000 0.700000
chr21 48081860 48082059 0.800000 0.000000
chr21 48082060 48082100 0.707317 0.000000
chr21 48083043 48083242 0.585000 0.000000
chr21 48083243 48083442 0.580000 0.750000
chr21 48083443 48083642 0.515000 0.125000
chr21 48083643 48083718 0.578947 0.000000
chr21 48083955 48084154 0.290000 0.000000
chr21 48084155 48084354 0.415000 0.405000
chr21 48084355 48084536 0.648352 0.000000Step 3: Collect Read CountsThis has as many counts files as there are bam input files. This does not seem relevant to the current question so the `wc -l` and the tails are not added here.Step 4: Filter IntervalsWord Count:wc -l produces 43112.
The tail:chr21 47423385 47423584 + .
chr21 47423585 47423784 + .
chr21 47531900 47532099 + .
chr21 47532100 47532299 + .
chr21 47532300 47532499 + .
chr21 47545816 47546015 + .
chr21 47549176 47549375 + .
chr21 47552016 47552215 + .
chr21 47552216 47552415 + .
chr21 47556770 47556969 + .Step 5: Determine Germline Contig PloidyThis has multiple outputs and the `wc -l` and the tail does not seem to be useful here.Step 6: Scattered Interval List/Split IntervalsWord Count (Modified):for file in *; do echo "$(wc -l < "$file") lines in $file"; done
# Gives 95 lines for each list from 0000 to 0092 so 93 lists
# 95 lines * 93 lists = 8835 lines totalTail Ends:Tail of 0000:
@SQ SN:chr9_gl000201_random LN:36148 M5:dfb7e7ec60ffdcb85cb359ea28454ee9 UR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chrUn_gl000235 LN:34474 M5:118a25ca210cfbcdfb6c2ebb249f9680 UR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chrUn_gl000239 LN:33824 M5:99795f15702caec4fa1c4e15f8a29c07 UR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chr21_gl000210_random LN:27682 M5:851106a74238044126131ce2a8e5847cUR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chrUn_gl000231 LN:27386 M5:ba8882ce3a1efa2080e5d29b956568a4 UR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chrUn_gl000229 LN:19913 M5:d0f40ec87de311d8e715b52e4c7062e1 UR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chrM LN:16571 M5:d2ed829b8a1628d16cbeee88e88e39ebUR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chrUn_gl000226 LN:15008 M5:1c1b2cd1fccbc0a99b6a447fa24d1504 UR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chr18_gl000207_random LN:4262 M5:f3814841f1939d3ca19072d9e89f3fd7 UR:file:/path/to/GATK/referencehg19.fa
chr1 1 249250621 + .Tail of 0001:
@SQ SN:chr9_gl000201_random LN:36148 M5:dfb7e7ec60ffdcb85cb359ea28454ee9 UR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chrUn_gl000235 LN:34474 M5:118a25ca210cfbcdfb6c2ebb249f9680 UR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chrUn_gl000239 LN:33824 M5:99795f15702caec4fa1c4e15f8a29c07 UR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chr21_gl000210_random LN:27682 M5:851106a74238044126131ce2a8e5847cUR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chrUn_gl000231 LN:27386 M5:ba8882ce3a1efa2080e5d29b956568a4 UR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chrUn_gl000229 LN:19913 M5:d0f40ec87de311d8e715b52e4c7062e1 UR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chrM LN:16571 M5:d2ed829b8a1628d16cbeee88e88e39ebUR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chrUn_gl000226 LN:15008 M5:1c1b2cd1fccbc0a99b6a447fa24d1504 UR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chr18_gl000207_random LN:4262 M5:f3814841f1939d3ca19072d9e89f3fd7 UR:file:/path/to/GATK/referencehg19.fa
chr2 1 243199373 + .Tail of 0092:
@SQ SN:chr9_gl000201_random LN:36148 M5:dfb7e7ec60ffdcb85cb359ea28454ee9 UR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chrUn_gl000235 LN:34474 M5:118a25ca210cfbcdfb6c2ebb249f9680 UR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chrUn_gl000239 LN:33824 M5:99795f15702caec4fa1c4e15f8a29c07 UR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chr21_gl000210_random LN:27682 M5:851106a74238044126131ce2a8e5847cUR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chrUn_gl000231 LN:27386 M5:ba8882ce3a1efa2080e5d29b956568a4 UR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chrUn_gl000229 LN:19913 M5:d0f40ec87de311d8e715b52e4c7062e1 UR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chrM LN:16571 M5:d2ed829b8a1628d16cbeee88e88e39ebUR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chrUn_gl000226 LN:15008 M5:1c1b2cd1fccbc0a99b6a447fa24d1504 UR:file:/path/to/GATK/referencehg19.fa
@SQ SN:chr18_gl000207_random LN:4262 M5:f3814841f1939d3ca19072d9e89f3fd7 UR:file:/path/to/GATK/referencehg19.fa
chr18_gl000207_random 1 4262 + .Step 7: Germline CNV Caller
Has no outputs due to the error recieved.
Note(s):
The paths have been removed due to the rules I need to follow as I am running this on an HPC. There are real paths in the original files.
-
Looking at your scatter files it is clear that they are not prepared properly and the error is really due to your incompatible scatter files and filtered interval list.
The only line that is present after the header is the whole chromosome as a single interval
chr1 1 249250621 + .
Once you filter your intervals due to GC content, mappability and segmental duplication using FilterIntervalList you need to use that filtered intervallist as an input to generate your scattered interval lists of all of which cannot contain any intervals that is not in the actual filtered interval list.
Regards.
-
Incorporation Of Suggestion:
It worked. Thank you.
Current Error:
5:24:14.722 INFO GermlineCNVCaller - Shutting down engine
[November 15, 2023 at 3:24:14 PM GMT] org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller done. Elapsed time: 0.25 minutes.
Runtime.totalMemory()=1224736768
java.lang.IllegalArgumentException: At least two samples must be provided in COHORT mode.
at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:798)
at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.resolveIntervals(GermlineCNVCaller.java:419)
at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.doWork(GermlineCNVCaller.java:329)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:149)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)Relevant Section Of Script:
# Concatenate all the split interval files with -L flag:
shard_arguments=$(printf -- "-L %s " "${SplitIntervals_dir}"/*.interval_list)
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Create variable `sample_counter` to identify/index number of samples in the call.
samplecount=${num_bams} # Get the count of samples - 1 bam == 1 sample
samplecount=$((samplecount - 1)) # Decrement samplecount by 1
echo " "
echo "The sample count used for the final step is one less than the number of bams."
echo "The number of bams: $num_bams"
echo "The sample count is: $samplecount"
echo " "
counter=0
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Concatenate calls shard path .tsv files to pass in as --calls-shard-path, List of paths to GermlineCNVCaller model directories.:
calls_shard_list=$(printf -- "--calls-shard-path %s " "${GermlineCNVCaller_dir}/germline_cnv_caller_cohort-calls"/*)
echo "${calls_shard_list}"
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Concatenate models shard path .tsv files to pass in as --model-shard-path, List of paths to GermlineCNVCaller model directories.:
model_shard_list=$(printf -- "--model-shard-path %s " "${GermlineCNVCaller_dir}/germline_cnv_caller_cohort-model"/*)
echo "${model_shard_list}"
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Concatenate all the counts files in an array
Counts_Arguments_To_Be_Fed_Into_Step_7_GermlineCNVCaller=("${counts_dir}"/*.tsv)
# Concatenate all the shard files in an array
Shard_Files_To_Be_Fed_Into_Step_7_GermlineCNVCaller=("${SplitIntervals_dir}"/*.interval_list)
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Step 7: GermlineCNVCaller
echo " "
echo "Starting Step 7: GermlineCNVCaller"
echo " "
function run_GermlineCNVCaller() {
localannotated_intervals_list=$1
# Loop through each shard file path in Shard_Files_To_Be_Fed_Into_Step_7_GermlineCNVCaller
forshard_filein"${Shard_Files_To_Be_Fed_Into_Step_7_GermlineCNVCaller[@]}"; do
forcounts_argin"${Counts_Arguments_To_Be_Fed_Into_Step_7_GermlineCNVCaller[@]}"; do
echo"Processing: $counts_arg with annotated intervals: ${annotate_dir}/Annotate_Intervals.tsv"
shard_name=$(basename "$shard_file")
singularitycommand="singularity exec $singularity_image /gatk/gatk"
${singularitycommand} GermlineCNVCaller \
--contig-ploidy-calls"${DetermineGermlineContigPloidy_dir}/ploidy-calls"\
-I"$counts_arg"\
-L"$shard_file"\
--annotated-intervals"${annotate_dir}/Annotate_Intervals.tsv"\
-output"${GermlineCNVCaller_dir}"\
--output-prefixStep_7___GermlineCNVCaller___${counts_arg}_${shard_name}______\
--interval-merging-ruleOVERLAPPING_ONLY\
--run-modeCOHORT\
--verbosityDEBUG
done
done
}
run_GermlineCNVCaller
# Wait for all background processes to finish
waitWhy is this occurring when there were 5 bams/samples inputted in the beginning of the pipeline? -
Hi Y R
Can you check your log files to see the actual number of samples fed into GermlineCNVCaller step?
Looks like your script is not properly adding all samples at once to the corresponding step.
Also 5 samples for a cohort is too low. We would suggest you to have at least 30 samples or more for a cohort study.
Regards.
Please sign in to leave a comment.
32 comments