GATK "At least one interval must remain after intersection" For `FilterIntervals` Command
I am trying to follow the GATK g CNV protocol. I have an error from my `FilterIntervals` in GATK. It is in a for loop. I defined the following:
- `bam_filename` is the name of the bam file
- `counts_dir` is where the `CollectReadCounts` command output's data
- `annotate_dir` is where the `AnnotateIntervals` outputs
- `reference` points to a reference fasta.
- `targets_interval_list` was made using `BedToIntervalList` from Picard.
```
# Step 1: Preprocessing singularity exec $singularity_image /gatk/gatk --java-options -Xms8000m \ PreprocessIntervals \ -R ${reference} \ -I ${bam_file} \ -L ${targets_interval_list} \ --bin-length 0 \ --interval-merging-rule OVERLAPPING_ONLY \ -O "${preprocessed_interval_list_dir}/${bam_filename}_targets_preprocessed.interval_list" echo "Starting Step 2: AnnotateIntervals with GC content" mkdir -p "${output_dir}/AnnotateIntervals/" annotate_dir="${output_dir}/AnnotateIntervals/" # Step 2: AnnotateIntervals with GC content singularity exec $singularity_image /gatk/gatk --java-options -Xms8000m \ AnnotateIntervals \ -L ${targets_interval_list} \ -R ${reference} \ --interval-merging-rule OVERLAPPING_ONLY \ -O "${annotate_dir}/${bam_filename}_annotated.tsv" \ --mappability-track ${bed_file} echo "Starting Step 3: CollectReadCounts" # mkdir -p "${output_dir}/CollectReadCounts" counts_dir="${output_dir}/CollectReadCounts" # Step 3: Collect read counts for the samples singularity exec $singularity_image /gatk/gatk --java-options -Xms8000m \ CollectReadCounts \ -L ${targets_interval_list} \ -R ${reference} \ --interval-merging-rule OVERLAPPING_ONLY \ -I ${bam_file} \ --format TSV \ -O "${counts_dir}/${bam_filename}_counts.tsv" mkdir -p "${output_dir}/Filtered_Intervals" filtered_intervals_dir="${output_dir}/AnnotateIntervals/" singularity exec $singularity_image /gatk/gatk --java-options -Xms8000m \ FilterIntervals \ -L "${preprocessed_interval_list_dir}/${bam_filename}_targets_preprocessed.interval_list" \ --annotated-intervals "${annotate_dir}/${bam_filename}_annotated.tsv" \ -I "${counts_dir}/${bam_filename}_counts.tsv" \ --interval-merging-rule OVERLAPPING_ONLY \ -O "${filtered_intervals_dir}/${bam_filename}_filtered.interval_list" echo "Finished up to FilterIntervals"
```
Runtime.totalMemory()=8388608000
java.lang.IllegalArgumentException: At least one interval must remain after intersection.
at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:798)
at org.broadinstitute.hellbender.tools.copynumber.FilterIntervals.resolveAndValidateIntervals(FilterIntervals.java:368)
at org.broadinstitute.hellbender.tools.copynumber.FilterIntervals.doWork(FilterIntervals.java:286)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:149)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
FilterIntervals`.
-
Hi Y R,
The FilterIntervals tool is first taking the intersection of the -L intervals with the annotated intervals, and then taking the intersection of that with the intervals from the TSV provided to -I. A couple of suggestions for things to check:
1. Could you please confirm that both the interval_list file output by PreprocessIntervals and the TSV output by CollectReadCounts are non-empty?2. Assuming they're both non-empty, could you also check the contig names in all three of these inputs:
-L "${preprocessed_interval_list_dir}/${bam_filename}_targets_preprocessed.interval_list" \ --annotated-intervals "${annotate_dir}/${bam_filename}_annotated.tsv" \ -I "${counts_dir}/${bam_filename}_counts.tsv"
and confirm that the contig names are compatible? If one of them is using a different contig naming scheme, or a different reference, the intersection will be empty.
Regards,
David -
Hello David, They are not empty
-
Hi Y R
Is that only a single sample provided to the FilterIntervals tool? You need to provide all related samples at the same time to this tool not in a for loop seperately.
-
Hi Y R,
In the original version of your post, you said "I am not sure why but on a second pass of my code the Filter Intervals command seems to have worked." Can you please clarify whether it's working for you now, or are you still getting the same error?
Also, have you checked the contig names in the interval_list and the two TSVs to make sure that they are compatible?Regards,
David
-
Yes I found that it worked with preprocessed intervals fed into filter intervals but I am feeding in only one sample at a time and therefore I am not sure if I am doing this correctly and therefore I was hesitant to respond as I am unsure if I have solved the issues in the code I submitted above.
-
I am working with Y R on this script and adjusted the script based on our understanding of SkyWarrior comment on providing the directory containing all batch samples at once rather than one sample at a time via for loop.
echo "I entered the BAM file loop"
echo "Bam directory is: $BAM_DIR"
echo "Starting Step 1: Preprocessing"
# Extract the BAM file name without the extension
bam_filename=$(basename "${bam_file}")
bam_filename="${bam_filename%.*}"
echo "My bam file names it ${bam_filename}"
mkdir -p "${output_dir}/Preprocessed_Interval_Lists"
preprocessed_interval_list_dir="${output_dir}/Preprocessed_Interval_Lists"
# Step 1: Preprocessing
singularity exec $singularity_image /gatk/gatk --java-options -Xms8000m \
PreprocessIntervals \
-R ${reference} \
-L ${targets_interval_list} \
--bin-length ${NUM_BINS} \
--interval-merging-rule OVERLAPPING_ONLY \
-O "${preprocessed_interval_list_dir}/"
echo "Starting Step 2: AnnotateIntervals with GC content"
mkdir -p "${output_dir}/AnnotateIntervals"
annotate_dir="${output_dir}/AnnotateIntervals"
# Step 2: AnnotateIntervals with GC content
singularity exec $singularity_image /gatk/gatk --java-options -Xms8000m \
AnnotateIntervals \
-L "${preprocessed_interval_list_dir}/" \
-R ${reference} \
--interval-merging-rule OVERLAPPING_ONLY \
-O "${annotate_dir}/" \
--mappability-track "${bed_file}"
echo "Starting Step 3: CollectReadCounts"
mkdir -p "${output_dir}/CollectReadCounts"
counts_dir="${output_dir}/CollectReadCounts"
# Step 3: Collect read counts for the samples
singularity exec $singularity_image /gatk/gatk --java-options -Xms8000m \
CollectReadCounts \
-L "${preprocessed_interval_list_dir}/" \
-R ${reference} \
--interval-merging-rule OVERLAPPING_ONLY \
-I "${BAM_DIR}/" \
--format TSV \
-O "${counts_dir}/"
# Step 4: FilterIntervals based on GC-content and read
mkdir -p "${output_dir}/Filtered_Intervals"
filtered_intervals_dir="${output_dir}/Filtered_Intervals"
singularity exec $singularity_image /gatk/gatk --java-options -Xms8000m \
FilterIntervals \
-L "${preprocessed_interval_list_dir}/" \
--annotated-intervals "${annotate_dir}/" \
-I "${counts_dir}/" \
--interval-merging-rule OVERLAPPING_ONLY \
-O "${filtered_intervals_dir}/"
echo "Finished up to FilterIntervals"The error below is implying that we need to point to a specific input sample/file rather than all files at once.A USER ERROR has occurred: Couldn't read file. Error was: {path/to/file} with exception: Cannot read file because it is a directory: file:///path/to/bams/
***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
Using GATK jar /gatk/gatk-package-4.4.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms8000m -jar /gatk/gatk-package-4.4.0.0-local.jar CollectReadCounts -L path/Preprocessed_Interval_Lists/ -R /path/Reference/Homo_sapiens/reference_fasta --interval-merging-rule OVERLAPPING_ONLY -I /path/to/bams/ --format TSV -O /path/CollectReadCounts/
20:47:19.975 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
20:47:20.019 INFO FilterIntervals - ------------------------------------------------------------
20:47:20.024 INFO FilterIntervals - The Genome Analysis Toolkit (GATK) v4.4.0.0
20:47:20.024 INFO FilterIntervals - For support and documentation go to https://software.broadinstitute.org/gatk/
20:47:20.025 INFO FilterIntervals - Executing as user@node41 on Linux v4.19.0-20-amd64 amd64
20:47:20.026 INFO FilterIntervals - Java runtime: OpenJDK 64-Bit Server VM v17.0.6+10-Ubuntu-0ubuntu118.04.1
20:47:20.027 INFO FilterIntervals - Start Date/Time: June 8, 2023 at 8:47:19 PM GMT
20:47:20.027 INFO FilterIntervals - ------------------------------------------------------------
20:47:20.028 INFO FilterIntervals - ------------------------------------------------------------
20:47:20.029 INFO FilterIntervals - HTSJDK Version: 3.0.5
20:47:20.029 INFO FilterIntervals - Picard Version: 3.0.0
20:47:20.030 INFO FilterIntervals - Built for Spark Version: 3.3.1
20:47:20.031 INFO FilterIntervals - HTSJDK Defaults.COMPRESSION_LEVEL : 2
20:47:20.031 INFO FilterIntervals - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
20:47:20.032 INFO FilterIntervals - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
20:47:20.032 INFO FilterIntervals - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
20:47:20.033 INFO FilterIntervals - Deflater: IntelDeflater
20:47:20.034 INFO FilterIntervals - Inflater: IntelInflater
20:47:20.034 INFO FilterIntervals - GCS max retries/reopens: 20
20:47:20.035 INFO FilterIntervals - Requester pays: disabled
20:47:20.036 INFO FilterIntervals - Initializing engine
20:47:20.037 INFO FilterIntervals - Done initializing engine
20:47:20.041 INFO FilterIntervals - Shutting down engine
[June 8, 2023 at 8:47:20 PM GMT] org.broadinstitute.hellbender.tools.copynumber.FilterIntervals done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=8388608000
*********************************************************************** -
Hi Hayden Shinn
You need to feed all readcount files with -I switch seperately or inside an arguments file to gatk FilterIntervals tool.
You may use a loop to string concatenate your readcount files with -I prefix or you may prepare an arguments file like this and feed this file using --arguments-file parameter to gatk.
-I readcount1.tsv
-I readcount2.tsv
-I readcount3.tsv
... -
The error occurred any time we feed in a directory since we are dealing with samples done in groups/batches we do not know how many samples will be fed in at a time. If I read your previous posting correctly SkyWarrior we have to provide the entire batch each time but we cannot do this with one bam at a time?
-
Anything in between FilterIntervals to GermlineCNVCalls (Including end tasks) readcount files must be fed using -I switch all at the same time. That means if you have 200 samples to feed you need to use -I switch 200 times in your commandline or you may generate a input parameter file as I indicated that contains all 200 readcount files with -I prefix per line.
I used a simple perl script to collect all readcountfiles in an array and add them to the commandline automatically. You may try something similar.
-
That is an interesting way to feed it in - I was wondering how you added them to command line automatically?
-
Pretty rough but it does the job fine.
#!/usr/bin/perl
use warnings;
use strict;
use POSIX;
##Collect all hdf5 readcounts into an array
my @filelist = glob("*.hdf5");
##
(my $name, my $pass, my $uid, my $gid, my $quota, my $comment, my $gcos, my $dir, my $shell, my $expire) = getpwnam(getlogin());
my $intervalfile = $ARGV[4];
my $annovatedfile = $ARGV[5] or die "Missing interval and annotation files";
my $cpucount = $ARGV[6] or die "choose the number of cpus to run";
my $gatkver = $ARGV[7] || "4.4.0.0";
my $samplecount = scalar(@filelist);
my $counter = 0;
my $dockercommand = "docker run --rm -t -u ".$uid.":".$gid." --cpus ".$cpucount." -v \$PWD:/home/\${PWD##*/} -w /home/\${PWD##*/} -e THEANO_FLAGS=\"base_compiledir=/home/\${PWD##*/}/tmp\" broadinstitute/gatk:".$gatkver." ";
my $FilterIntCommand= $dockercommand."gatk FilterIntervals --tmp-dir tmp -L ".$intervalfile." --annotated-intervals ".$annovatedfile." -imr OVERLAPPING_ONLY -O filtered.interval_list ";
my $DCPcommand = $dockercommand."gatk DetermineGermlineContigPloidy --tmp-dir tmp --contig-ploidy-priors contig_priors.tsv --output DCP --output-prefix CP_CNV --verbosity DEBUG -imr OVERLAPPING_ONLY -L filtered.interval_list ";
my $CNVcallcommand = $dockercommand."gatk GermlineCNVCaller --tmp-dir tmp --run-mode COHORT --contig-ploidy-calls ./DCP/CP_CNV-calls --output GCNV --output-prefix CNV --annotated-intervals ".$annovatedfile." --verbosity DEBUG -L filtered.interval_list -imr OVERLAPPING_ONLY --max-training-epochs 100 --initial-temperature 2.0 --convergence-snr-trigger-threshold 0.2 ";
my $postprocesscommand = $dockercommand."gatk PostprocessGermlineCNVCalls --tmp-dir tmp --calls-shard-path GCNV/CNV-calls --model-shard-path GCNV/CNV-model --autosomal-ref-copy-number 2 --allosomal-contig chrX --allosomal-contig chrY --contig-ploidy-calls DCP/CP_CNV-calls -imr OVERLAPPING_ONLY -verbosity DEBUG ";
##Add each hdf5 file to the end of the commandline string with -I prefix
for (@filelist)
{
$FilterIntCommand .= "-I ".$_." ";
$DCPcommand .= "-I ".$_." ";
$CNVcallcommand .= "-I ".$_." ";
}
system("mkdir -p tmp");
if($ARGV[0] == 1)
{
system($FilterIntCommand);
}
if($ARGV[1] == 2)
{
system($DCPcommand);
}
if($ARGV[2] == 3)
{
system($CNVcallcommand);
}
if($ARGV[3] >= 4 && $ARGV[3] <= 5)
{
##for each sample run postprocess seperately and generate segments.vcf files.
while ($counter < $samplecount)
{
my$postprocesscommandfinal = $postprocesscommand."--sample-index ".$counter." --output-genotyped-intervals ".$filelist[$counter]."_interval.vcf.gz --output-genotyped-segments ".$filelist[$counter]."_segments.vcf.gz --output-denoised-copy-ratios ".$filelist[$counter]."denoised-copy-ratios.txt";
system($postprocesscommandfinal);
$counter++;
}
} -
Interesting and you feed this in in between the beginning of the command and the end. Thank you.
-
Hi there, we are still working on this pipeline but seem to be hitting an error around PostProcessingGermlineCNVCalls step as we aren't getting an output for that but getting outputs for every previous step. A few questions about `PostProcessingGermlineCNVCalls` :
- What intervals should be passed in for the `-L` interval parameter?
- Do these paths look correct or is anything noticeable wrong with the code? SkyWarrior
# Function to add the "-I" flag at the beginning of each line of the array list
And here is the end of the out file (proprietary paths omitted) with the PostProcessing Step
add_I_flag() {
local input_array=("$@")
printf -- "-I %s\n""${input_array[@]}"
}
# Function to add the "-L" flag at the beginning of each line of the array list
add_L_flag() {
local input_array=("$@")
printf -- "-L %s\n""${input_array[@]}"
}
#BAM files
bam_files=("${BAM_DIR}"/*.bam)
bam_filenames=($(basename "${bam_files[@]}" .bam))
#______________________________________________________________________________________________________________________________________________________________
# Step 1: Preprocessing
mkdir -p "${output_dir}/Preprocessed_Interval_Lists"
preprocessed_interval_list_dir="${output_dir}/Preprocessed_Interval_Lists"
output1=$(singularity exec $singularity_image /gatk/gatk --java-options -Xms10g \
PreprocessIntervals \
-R "${reference}" \
-L "${targets_interval_list}" \
--bin-length "${NUM_BINS}" \
--interval-merging-rule OVERLAPPING_ONLY \
-O "${preprocessed_interval_list_dir}/preprocessed_intervals.interval_list")
`echo "$output1"`
preprocessed_output+=("$output1")
# Step 2: AnnotateIntervals with GC content
mkdir -p "${output_dir}/AnnotateIntervals"
annotate_dir="${output_dir}/AnnotateIntervals"
output2=$(singularity exec $singularity_image /gatk/gatk --java-options -Xms10g \
AnnotateIntervals \
-L "${preprocessed_interval_list_dir}/preprocessed_intervals.interval_list" \
-R "${reference}" \
--interval-merging-rule OVERLAPPING_ONLY \
-O "${annotate_dir}/annotated_intervals.interval_list" \
--mappability-track "${bed_file}")
`echo "$output2"`
annotated_output+=("$output2")
# Step 3: CollectReadCounts
mkdir -p "${output_dir}/CollectReadCounts"
counts_dir="${output_dir}/CollectReadCounts"
for bam_file in "${bam_files[@]}"; do
bam_filename=$(basename "$bam_file")
bam_filename="${bam_filename%.*}"
echo"${bam_filename} being run"
output3=$(singularity exec $singularity_image /gatk/gatk --java-options -Xms10g \
CollectReadCounts \
-I "${bam_file}" \
-L "${preprocessed_interval_list_dir}/preprocessed_intervals.interval_list" \
-R "${reference}" \
--interval-merging-rule OVERLAPPING_ONLY \
-O "${counts_dir}/${bam_filename}_counts.tsv")
echo"$output3"
counts_output+=("$output3")
done
# Step 4: FilterIntervals based on GC-content and read
mkdir -p "${output_dir}/Filtered_Intervals"
filtered_intervals_dir="${output_dir}/Filtered_Intervals"
# Concatenate all the counts files with -I flag
counts_arguments=$(printf -- "-I %s " "${counts_dir}"/*.tsv)
output4=$(singularity exec $singularity_image /gatk/gatk --java-options -Xms10g \
FilterIntervals \
${counts_arguments} \
-L "${preprocessed_interval_list_dir}/preprocessed_intervals.interval_list" \
--annotated-intervals "${annotate_dir}/annotated_intervals.interval_list" \
--interval-merging-rule OVERLAPPING_ONLY \
-O "${filtered_intervals_dir}/filtered_intervals.interval_list")
filtered_output+=("$output4")
echo "Finished up to FilterIntervals"
# Step 5: DetermineGermlineContigPloidy
# Making a Determine Germline Contig Ploidy Folder
mkdir -p "${output_dir}/DetermineGermlineContigPloidy"
DetermineGermlineContigPloidy_dir="${output_dir}/DetermineGermlineContigPloidy"
singularity exec $singularity_image /gatk/gatk --java-options -Xms10g \
DetermineGermlineContigPloidy \
${counts_arguments} \
--contig-ploidy-priors ${contig_ploidy_priors} \
--interval-merging-rule OVERLAPPING_ONLY \
--output-prefix ploidy \
--output ${DetermineGermlineContigPloidy_dir}
# Step 6: SplitIntervals
echo "Creating shards via bins"
# Make a split intervals folder
mkdir -p ${output_dir}/Split_Intervals
SpllitIntervals_dir=${output_dir}/Split_Intervals
# run SplitIntervals with shard name FIX
singularity exec $singularity_image /gatk/gatk --java-options -Xms10g \
SplitIntervals \
-R ${reference} \
-L "${preprocessed_interval_list_dir}/preprocessed_intervals.interval_list" \
--scatter-count ${NUM_BINS} \
--interval-merging-rule OVERLAPPING_ONLY \
-O "${output_dir}/Split_Intervals/"
mkdir -p "${output_dir}/GermlineCNVCaller"
GermlineCNVCaller_dir="${output_dir}/GermlineCNVCaller"
# Concatenate all the split interval files with -L flag
shard_arguments=$(printf -- "-L %s " "${SpllitIntervals_dir}"/*.interval_list)
# Step 7) GermlineCNVCaller (Mandatory) attempting without forloop
singularity exec $singularity_image /gatk/gatk --java-options -Xms15g \
GermlineCNVCaller \
${shards_arguments} \
--interval-merging-rule OVERLAPPING_ONLY \
--contig-ploidy-calls "${DetermineGermlineContigPloidy_dir}/ploidy-calls" \
${counts_arguments} \
--annotated-intervals "${annotate_dir}/annotated_intervals.interval_list" \
--output "${GermlineCNVCaller_dir}" \
--output-prefix germline_cnv_caller_cohort \
--run-mode COHORT
# Step 8) Step PostProcessGermlineCNVCalls (Mandatory) FIX [copied from gatk tool index]
#Concatenate calls shard path to pass in as --calls-shard-path parameter, List of paths to GermlineCNVCaller call directories.
#Concatenate models shard path to pass in as --model-shard-path, List of paths to GermlineCNVCaller model directories.
mkdir -p "${output_dir}/PostprocessGermlineCNVCalls"
PostprocessGermlineCNVCalls_dir="${output_dir}/PostprocessGermlineCNVCalls"
singularity exec $singularity_image /gatk/gatk --java-options -Xms10g \
PostprocessGermlineCNVCalls \
--calls-shard-path "${GermlineCNVCaller_dir}/germline_cnv_caller_cohort-calls/interval.tsv" \
--contig-ploidy-calls "${DetermineGermlineContigPloidy_dir}/ploidy-calls" \
--model-shard-path "${GermlineCNVCaller_dir}/germline_cnv_caller_cohort-model/interval.tsv" \
--intervals "${annotate_dir}/annotated_intervals.interval_list" \
--sample-index 0 \
--autosomal-ref-copy-number 2 \
--allosomal-contig X \
--allosomal-contig Y \
--sequence-dictionary ${sequence_dict} \
--interval-merging-rule OVERLAPPING_ONLY \
--create-output-variant-index \
--output-genotyped-intervals "${PostprocessGermlineCNVCalls_dir}/name_cohort_intervals.clustered.vcf" \
--output-genotyped-segments "${PostprocessGermlineCNVCalls_dir}/name_cohort_segments.clustered.vcf" \
--output-denoised-copy-ratios "${PostprocessGermlineCNVCalls_dir}/name_cohort_copy_ratios.clustered.tsv" \
-R ${reference} \
--versionThe Genome Analysis Toolkit (GATK) v4.4.0.0
HTSJDK Version: 3.0.5
Picard Version: 3.0.0
Using GATK jar /gatk/gatk-package-4.4.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms10g -jar /gatk/gatk-package-4.4.0.0-local.jar \
PostprocessGermlineCNVCalls \
--calls-shard-path /path/to/GermlineCNVCaller/germline_cnv_caller_cohort-calls/interval.tsv \
--contig-ploidy-calls /path/to/DetermineGermlineContigPloidy/ploidy-calls \
--model-shard-path /path/to/GermlineCNVCaller/germline_cnv_caller_cohort-model/interval.tsv \
--intervals path/to/AnnotateIntervals/annotated_intervals.interval_list \
--sample-index 0 \
--autosomal-ref-copy-number 2 \
--allosomal-contig X \
--allosomal-contig Y \
--sequence-dictionary /path/sequence/dictionary.dict \
--interval-merging-rule OVERLAPPING_ONLY \
--create-output-variant-index \
--output-genotyped-intervals /path/to/PostprocessGermlineCNVCalls/name_cohort_intervals.clustered.vcf
--output-genotyped-segments /path/to/PostprocessGermlineCNVCalls/name_cohort_segments.clustered.vcf
--output-denoised-copy-ratios path/to/PostprocessGermlineCNVCalls/name_cohort_copy_ratios.clustered.tsv
-R /path/to/ref/ref.fa
--version -
Hi Hayden Shinn
Sorry for being late to answer. There is no need for -L parameter for PostProcessGermlineCNVCalls as all interval parameters were already consumed by GermlineCNVCaller. For the shards and model you need to provide the path to the GermlineCNVCaller output+prefix path. If say your output folder for GermlineCNVCaller is ./output and prefix is GCNV then you need to provide ./output/GCNV-calls as the shards path and ./output/GCNV-model as the model path. You can also check these folders to see if they are created properly.
There is no need for a specific file to be added there. PostProcessGermlineCNVCalls will consume all the data in that prefix according to the sample id. So for each sample you need to call PostProcessGermlineCNVCalls again and again only by changing the sample-index. Later on you will have the opportunity to joint call all segments to generate a all sample vcf file for CNV calls.
-
So do we need to change the following each time?:
--sample-index 0 \
-
Exactly
-
Could I ask for how you did this?
-
In my perl file I keep 2 variables: one for the total number of samples and the other is a simple integer starts at 0. I use these to run a loop to go over each sample during the post process step.
my $samplecount = scalar(@filelist);
my $counter = 0;##for each sample run postprocess seperately and generate segments.vcf files.
while ($counter < $samplecount)
{
my $postprocesscommandfinal = $postprocesscommand."--sample-index ".$counter." --output-genotyped-intervals ".$filelist[$counter]."_interval.vcf.gz --output-genotyped-segments ".$filelist[$counter]."_segments.vcf.gz --output-denoised-copy-ratios ".$filelist[$counter]."denoised-copy-ratios.txt";
system($postprocesscommandfinal);
$counter++;
} -
Thank you. For this counter based on `-sample-index` that is the shard path to the directory created by SplitIntervals?
-
--sample-index is the ID that gatk assigns to your sample when readcounts are loaded to DetermineContigPloidy and GermlineCNVCaller. Below is a sample from GermlineCNVCaller log
16:11:57.578 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes00" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_0"...
16:12:04.345 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes01" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_1"...
16:12:11.021 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes02" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_2"...
16:12:17.437 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes03" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_3"...
16:12:24.182 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes04" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_4"...
16:12:30.760 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes05" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_5"...
16:12:37.199 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes06" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_6"...
16:12:43.930 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes07" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_7"...
16:12:50.274 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes08" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_8"...
16:12:56.635 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes09" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_9"...
16:13:03.011 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes10" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_10"...
16:13:09.341 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes11" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_11"...
16:13:15.669 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes12" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_12"...
16:13:22.375 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes13" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_13"...
16:13:28.738 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes14" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_14"...
16:13:35.755 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes15" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_15"...Here --sample-index 0 belongs to Wes00 sample since I loaded Wes00.hdf5 file first. Each sample must be seperately processed by the post process algorithm to generate corresponding intervals segments and denoised ratios files.
Shards for the processed intervals by the tools are different and each shard contains all samples processed with the same --sample-index IDs
-
Hello I am running the entire script but I am not getting any output. What am I doing incorrectly? Is the `--sample-index` in `PostProcessGermlineCNVCalls` based on the shards:
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Create variable `sample_counter` to identify/index number of samples in the call.
num_shards=("${GermlineCNVCaller_dir}/germline_cnv_caller_cohort-calls"/*/*)
# Create an array with all files in the GerminlineCNVcaller-model-shards directory
samplecount=${#num_shards[@]} # Get the count of files
counter=0
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Concatenate calls shard path .tsv files to pass in as --calls-shard-path, List of paths to GermlineCNVCaller model directories.:
calls_shard_list=$(printf -- "--calls-shard-path %s " "${GermlineCNVCaller_dir}/germline_cnv_caller_cohort-calls"/*)
echo "${calls_shard_list}"
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Concatenate models shard path .tsv files to pass in as --model-shard-path, List of paths to GermlineCNVCaller model directories.:
model_shard_list=$(printf -- "--model-shard-path %s " "${GermlineCNVCaller_dir}/germline_cnv_caller_cohort-model"/*)
echo "${model_shard_list}"
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Step 8: PostProcessGermlineCNVCalls (Mandatory)
echo " "
echo "Starting Step 8: PostProcessGermlineCNVCalls"
echo " "
echo " "
echo " The number of shards: $samplecount"
echo " "
while [ $counter -lt $samplecount ]
do
echo"Processing sample $counter..."
singularityexec$singularity_image/gatk/gatk--java-options-Xms10G\
PostprocessGermlineCNVCalls\
${calls_shard_list} \
--contig-ploidy-calls"${DetermineGermlineContigPloidy_dir}/ploidy-calls"\
${model_shard_list} \
--sample-index$counter\
--autosomal-ref-copy-number2\
--allosomal-contigX\
--allosomal-contigY\
--sequence-dictionary ${sequence_dict} \
--interval-merging-ruleOVERLAPPING_ONLY\
--create-output-variant-index\
--output-genotyped-intervals"${PostprocessGermlineCNVCalls_dir}/genotyped_intervals.vcf.gz"\
--output-genotyped-segments"${PostprocessGermlineCNVCalls_dir}/genotyped_segments.vcf.gz"\
--output-denoised-copy-ratios"${PostprocessGermlineCNVCalls_dir}/denoised_copy_ratios.tsv"\
-R ${reference} \
--version
((counter++))
echo"Sample $counter processed."
done
echo " "
echo "Finished with all the required steps"
echo " "
#____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________I am only attaching the code for a specific section above. My `echo` statements are a check and when I ran the script is ended with no error but it did not have anything in the `PostProcessGermlineCNVCalls` directory. -
Hi Y R
--sample-index is not based on shards but based on the order and the number of samples processed in GermlineCNVCaller step. If you have 200 samples you need to start your --sample-index from 0 and process all samples all the way up to 199. shards are the parts of your interval_list that you divided into subparts to process your calls. Those shards are then united at the post processing step therefore they act as single unit.
-
I am curious if I am doing something incorrectly at the following 2 steps as it generates no error but it take an exceedingly long time (over 12 hours) to run 5 bam files through. It seems that it is doing something but it stalls at the following step:
Last Output In The .out file:
` Aggregating read-count path/to/.bam.counts.tsv (5 / 5)`
The code associated:
echo "Starting Step 6: SplitIntervals - Creating shards"
# Step 6: SplitIntervals
singularity exec $singularity_image /gatk/gatk --java-options -Xms"$memory" \
SplitIntervals \
-R ${reference} \
${filtered_intervals_list} \
${bam_arguments} \
--scatter-count ${NUM_BINS} \
--interval-merging-rule OVERLAPPING_ONLY \
-O "${SplitIntervals_dir}"
# Concatenate all the split interval files with -L flag
shard_arguments=$(printf -- "-L %s " "${SplitIntervals_dir}"/*.interval_list)
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
echo "Starting Step 7: GermlineCNVCaller"
# Step 7) GermlineCNVCaller (Mandatory) attempting without forloop
singularity exec $singularity_image /gatk/gatk --java-options -Xms15g \
GermlineCNVCaller \
${shards_arguments} \
--interval-merging-rule OVERLAPPING_ONLY \
--contig-ploidy-calls "${DetermineGermlineContigPloidy_dir}/ploidy-calls" \
${counts_arguments} \
${annotated_intervals_list} \
--output "${GermlineCNVCaller_dir}" \
--output-prefix germline_cnv_caller_cohort \
--run-mode COHORT
--verbosity DEBUGIs there something incorrect? It does go past this to the `PostprocessGermlineCNVCalls` but it is quite slow. Is this normal? Is there something wrong with the code written?
-
My question is if you will use all shards all at the same time in the same GermlineCNVCaller step why would you create them anyway. A single interval_list would suffice and it will create a single call shard. If your interval_list is too big such as a whole genome interval_list split at 1000bp intervals or so then creating shards is a good idea to accelerate the process but this time each split interval_list must be used seperately at the GermlineCNVCaller step. This will create different germlinecnvcaller shards but these shards then can be used all together at the post processing step.
I am using this approach when I need to analyze whole genome samples.
-
Is possible to see your example code for what to do for the GermlineCNVCaller and shards?
-
I know that this is totally out of scope for GATK and its accessory partners such as cromwell and wdl but for the sake of getting an answer I will post my perl script here with details. You are free to use it as you see fit. This perl script is designed for whole genome CNV analysis
I put all count files and preprocessed interval_list file and its annotation file and the contig_ploidy.tsv file in the same folder and within the folder I run the following perl script to completion.
use warnings;
use strict;
#Collect the list of all count files. I use HDF5 format.
my @filelist = glob("*.hdf5");
#Preprocessed interval_list and its annotation files are taken as parameters here
my $intervalfile = $ARGV[4];
my $annovatedfile = $ARGV[5] or die "Missing interval and annotation files";
#select the number of cpu threads dedicated for this task to limit containers
my $cpucount = $ARGV[6] or die "choose the number of cpus to run";
#Keep the number of samples as a integer count
my $samplecount = scalar(@filelist);
my $counter = 0;
#Core docker command to run all other commands
my $dockercommand = "docker run --rm -t -u 1000:1000 --cpus ".$cpucount." -v \$PWD:/home/\${PWD##*/} -w /home/\${PWD##*/} -e THEANO_FLAGS=\"base_compiledir=/home/\${PWD##*/}/tmp\" broadinstitute/gatk:4.4.0.0 ";
#Core FilterIntervals command to prefilter intervals based on all count files.
my $FilterIntCommand= $dockercommand."gatk FilterIntervals --tmp-dir tmp -L ".$intervalfile." --annotated-intervals ".$annovatedfile." -imr OVERLAPPING_ONLY -O filtered.interval_list ";
#Core DetermineGermlineContigPloidy command
my $DCPcommand = $dockercommand."gatk DetermineGermlineContigPloidy --tmp-dir tmp --contig-ploidy-priors contig_priors.tsv --output DCP --output-prefix CP_CNV --verbosity DEBUG -imr OVERLAPPING_ONLY -L filtered.interval_list ";
#Core GermlineCNVCaller command
my $CNVcallcommand = $dockercommand."gatk GermlineCNVCaller --tmp-dir tmp --run-mode COHORT --contig-ploidy-calls ./DCP/CP_CNV-calls --output GCNV --annotated-intervals ".$annovatedfile." --verbosity DEBUG -imr OVERLAPPING_ONLY --max-training-epochs 100 ";
#Core PostprocessGermlineCNVCalls command
my $postprocesscommand = $dockercommand."gatk PostprocessGermlineCNVCalls --tmp-dir tmp --autosomal-ref-copy-number 2 --allosomal-contig chrX --allosomal-contig chrY --contig-ploidy-calls DCP/CP_CNV-calls -imr OVERLAPPING_ONLY -verbosity DEBUG ";
#For each sample add -I switch in a for loop. Order of the samples does not change since all come from the same array
for (@filelist)
{
$FilterIntCommand .= "-I ".$_." ";
$DCPcommand .= "-I ".$_." ";
$CNVcallcommand .= "-I ".$_." ";
}
#Generate tmp directory for the use of Theano
system("mkdir -p tmp");
#Run Scatter intervals. If it was run before clear the previous ones and recreate them since it is trivial
if($ARGV[0] == 1)
{
system($FilterIntCommand);
system("rm -rf scatter");
system("mkdir -p scatter");
system("gatk IntervalListTools -I filtered.interval_list --SUBDIVISION_MODE INTERVAL_COUNT --SCATTER_COUNT 20 --OUTPUT scatter");
}
#Run DetermineGermlineContigPloidy
if($ARGV[1] == 2)
{
system($DCPcommand);
}
#Collect all scatter interval lists in an array as a path
my @scatterlist = glob("./scatter/temp*/*.interval_list");
if($ARGV[2] == 3)
{
#For each scattered interval_list append -L switch seperately to GermlineCNVCaller and run seperately.
#Each scatter must be run in a seperate task for GCNV step
for(@scatterlist)
{
my @scatter = split('/', $_);
my $scatterfileorder = $scatter[2];
my $tempCNVcallcommand = $CNVcallcommand."-L ".$_." --output-prefix CNV_".$scatterfileorder;
system($tempCNVcallcommand);
system("rm ".$_);
}
}
#Each scattered GCNV task generates its own calls and models as a shard. Collect them as paths in an array
my @GCNVcallslist = glob("./GCNV/CNV_temp*calls");
my @GCNVmodellist = glob("./GCNV/CNV_temp*model");
#Get the number of shards
my @count = (0..scalar(@GCNVcallslist)-1);
if($ARGV[3] >= 4 && $ARGV[3] <= 5)
{
my $temppostprocesscommand = $postprocesscommand;
#For each shard generated by GCNV step add them with --calls-shard-path and --model-shard-path switches to PostProcessGermlineCNVCalls core command
for(@count)
{
$temppostprocesscommand .=" --calls-shard-path ".$GCNVcallslist[$_];
$temppostprocesscommand .=" --model-shard-path ".$GCNVmodellist[$_];
}
$temppostprocesscommand .=" ";
#print $temppostprocesscommand."\n";
#For each sample run the appended PostProcessGermlineCNVCalls command with its appropriate --sample-index value seperately
while ($counter < $samplecount)
{
my $temppostprocesscommandfinal = $temppostprocesscommand."--sample-index ".$counter." --output-genotyped-intervals ".$filelist[$counter]."_interval.vcf.gz --output-genotyped-segments ".$filelist[$counter]."_segments.vcf.gz --output-denoised-copy-ratios ".$filelist[$counter]."denoised-copy-ratios.txt";
system($temppostprocesscommandfinal);
$counter++;
}
}I hope this helps.
-
Wait this doesn't include SplitIntervals to create the scatter? My code is below is this wrong:
# Step 6: SplitIntervals:
echo " "
echo "Starting Step 6: SplitIntervals - Creating shards"
echo " "
singularity exec $singularity_image /gatk/gatk SplitIntervals \
-R ${reference} \
${filtered_intervals_list} \
${bam_arguments} \
--scatter-count ${NUM_BINS} \
--interval-merging-ruleOVERLAPPING_ONLY\
--output"${SplitIntervals_dir}"
# Concatenate all the split interval files with -L flag:
shard_arguments=$(printf -- "-L %s " "${SplitIntervals_dir}"/*.interval_list)That is the section of my script dealing with SplitIntervals. Is this wrong?
-
Here is the code for the SplitIntervals. It is in the code actually. I am using IntervalListTools for this purpose.
#Run Scatter intervals. If it was run before clear the previous ones and recreate them since it is trivial
if($ARGV[0] == 1)
{
system($FilterIntCommand);
system("rm -rf scatter");
system("mkdir -p scatter");
system("gatk IntervalListTools -I filtered.interval_list --SUBDIVISION_MODE INTERVAL_COUNT --SCATTER_COUNT 20 --OUTPUT scatter");
} -
I get the most peculiar error from the GermlineCNVCaller:
23:17:28.757 INFO IntervalArgumentCollection - Processing 15498753 bp from intervals
23:17:28.761 INFO GermlineCNVCaller - Reading and validating annotated intervals...
23:17:30.382 INFO GermlineCNVCaller - Shutting down engine
[August 1, 2023 at 11:17:30 PM GMT] org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller done. Elapsed time: 0.43 minutes.
Runtime.totalMemory()=1224736768
java.lang.IllegalArgumentException: Annotated intervals do not contain all specified intervals.
at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:798)
at org.broadinstitute.hellbender.tools.copynumber.arguments.CopyNumberArgumentValidationUtils.validateAnnotatedIntervalsSubset(CopyNumberArgumentValidationUtils.java:261)
at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.resolveIntervals(GermlineCNVCaller.java:406)
at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.doWork(GermlineCNVCaller.java:329)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:149)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /gatk/gatk-package-4.4.0.0-local.jar
[...paths removed]
Processing: path/to/Step_3___CollectReadCounts/Sample_Name.bam.counts.tsv with annotated path/to/Outputs_GATKgCNV/Step_2___AnnotateIntervals/Annotate_Intervals.tsv
23:17:33.617 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
23:17:33.725 DEBUG NativeLibraryLoader - Extracting libgkl_compression.so to /tmp/libgkl_compression10297239473819555856.so
23:17:33.760 INFO GermlineCNVCaller - ------------------------------------------------------------Why wouldn't the Annotate Intervals file contain all the intervals? I have placed the associated code block below:
# BAM_DIR is the variable that contains the bam directory
output_dir=${BAM_DIR}/Outputs_GATKgCNV
mkdir ${output_dir}
echo " "
echo "The output directory is: $output_dir"
echo " "
#____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
# Making a Preprocessed_Interval_Lists Folder:
mkdir -p "${output_dir}/Step_1___Preprocessed_Interval_Lists"
preprocessed_interval_dir="${output_dir}/Step_1___Preprocessed_Interval_Lists"
# Making a AnnotateIntervals Folder:
mkdir -p "${output_dir}/Step_2___AnnotateIntervals"
annotate_dir="${output_dir}/Step_2___AnnotateIntervals"
# Making a CollectReadCounts Folder:
mkdir -p "${output_dir}/Step_3___CollectReadCounts"
counts_dir="${output_dir}/Step_3___CollectReadCounts"
# Making a Filtered Interval folder:
mkdir -p "${output_dir}/Step_4___Filtered_Intervals"
filtered_intervals_dir="${output_dir}/Step_4___Filtered_Intervals"
# Making a Determine Germline Contig Ploidy Folder:
mkdir -p "${output_dir}/Step_5___DetermineGermlineContigPloidy"
DetermineGermlineContigPloidy_dir="${output_dir}/Step_5___DetermineGermlineContigPloidy"
# Make a Split intervals folder:
mkdir -p ${output_dir}/Step_6___Split_Intervals
SplitIntervals_dir="${output_dir}/Step_6___Split_Intervals"
# Making a GermlineCNVCaller Folder:
mkdir -p "${output_dir}/Step_7___GermlineCNVCaller"
GermlineCNVCaller_dir="${output_dir}/Step_7___GermlineCNVCaller"
# Making a PostprocessGermlineCNVCalls Folder:
mkdir -p "${output_dir}/Step_8___PostprocessGermlineCNVCalls"
PostprocessGermlineCNVCalls_dir="${output_dir}/Step_8___PostprocessGermlineCNVCalls"
echo " "
echo "All sub-folders created"
echo " "
#____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
# Step 1: Preprocessing
echo " "
echo "Starting Step 1: Preprocessing"
echo " "
singularity exec $singularity_image /gatk/gatk PreprocessIntervals \
--reference ${reference} \
--intervals ${targets_interval_list} \
--bin-length ${NUM_BINS} \
--interval-merging-ruleOVERLAPPING_ONLY\
--output"${preprocessed_interval_dir}/PreprocessIntervals.interval_list"
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Step 2: AnnotateIntervals with GC content
echo " "
echo "Starting Step 2: AnnotateIntervals with GC content"
echo " "
singularity exec $singularity_image /gatk/gatk AnnotateIntervals \
--intervals"${preprocessed_interval_dir}/PreprocessIntervals.interval_list"\
--reference ${reference} \
--interval-merging-ruleOVERLAPPING_ONLY\
--output"${annotate_dir}/Annotate_Intervals.tsv"\
--mappability-track"${bed_file}"
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Concatenate all the bam files with -I flag (done):
echo " "
echo "Making a -I flag for the bam directory"
echo " "
bam_arguments=$(printf -- "-I %s\n" "${BAM_DIR}"/*.bam)
echo "${bam_arguments}"
echo " "
echo "Those were all the bam files with -I "
echo " "
# _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Step 3: Collect Read Counts:
echo " "
echo "Starting Step 3: CollectReadCounts"
echo " "
# Create an array to store the BAM file paths:
bam_files=()
# Iterate over each BAM file in BAM_DIR and add it to the array:
for bam_file in "$BAM_DIR"/*.bam; do
bam_files+=("$bam_file")
done
# Count the number of BAM files in the directory:
num_bams=$(find "$BAM_DIR" -type f -name "*.bam" | wc -l)
echo " "
echo "Number of BAM files: $num_bams"
echo " "
# Run CollectReadCounts on each BAM file:
for bam_file in "${bam_files[@]}"; do
singularityexec$singularity_image/gatk/gatkCollectReadCounts\
--input"$bam_file"\
--intervals"${preprocessed_interval_dir}/PreprocessIntervals.interval_list"\
--reference"${reference}"\
--interval-merging-ruleOVERLAPPING_ONLY\
--formatTSV\
--output"${counts_dir}/$(basename "$bam_file").counts.tsv"
done
echo " "
echo "Step 3: CollectReadCounts completed"
echo " "
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Concatenate all the counts files with -I flag:
counts_arguments=$(printf -- "-I %s " "${counts_dir}"/*.tsv)
echo " "
echo "These are the CollectReadCounts file(s) with the flag -I below:"
echo " "
echo "${counts_arguments}"
echo " "
echo "This is all the CollectReadCounts file(s) with the flag -I."
echo " "
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Concatenate the interval list for --annotate-intervals parameter (should be 1 file but just in case):
annotated_intervals_list=$(printf -- "--annotated-intervals %s " "${annotate_dir}"/*.tsv)
echo "${annotated_intervals_list}"
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Step 4: FilterIntervals based on GC-content and read:
echo " "
echo "Starting Step 4: FilterIntervals"
echo " "
singularity exec $singularity_image /gatk/gatk FilterIntervals \
--intervals ${preprocessed_interval_dir}/PreprocessIntervals.interval_list\
${annotated_intervals_list} \
${counts_arguments} \
--interval-merging-ruleOVERLAPPING_ONLY\
--output"${filtered_intervals_dir}/Filter_Intervals.interval_list"
echo "Finished up to FilterIntervals"
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Step 5: DetermineGermlineContigPloidy:
echo " "
echo "Starting Step 5: DetermineGermlineContigPloidy"
echo " "
singularity exec $singularity_image /gatk/gatk DetermineGermlineContigPloidy \
--intervals"${filtered_intervals_dir}/Filter_Intervals.interval_list"\
--interval-merging-ruleOVERLAPPING_ONLY\
${counts_arguments} \
--contig-ploidy-priors ${contig_ploidy_priors} \
--output-prefixploidy\
--output ${DetermineGermlineContigPloidy_dir}
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Step 6: SplitIntervals:
echo " "
echo "Starting Step 6: SplitIntervals - Creating shards"
echo " "
singularity exec $singularity_image /gatk/gatk SplitIntervals \
--reference ${reference} \
${filtered_intervals_list} \
${bam_arguments} \
--scatter-count ${NUM_BINS} \
--interval-merging-ruleOVERLAPPING_ONLY\
--output"${SplitIntervals_dir}"
# Concatenate all the split interval files with -L flag:
shard_arguments=$(printf -- "-L %s " "${SplitIntervals_dir}"/*.interval_list)
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Create variable `sample_counter` to identify/index number of samples in the call.
samplecount=${num_bams} # Get the count of samples - 1 bam == 1 sample
samplecount=$((samplecount - 1)) # Decrement samplecount by 1
echo " "
echo "The sample count used for the final step is one less than the number of bams."
echo "The number of bams: $num_bams"
echo "The sample count is: $samplecount"
echo " "
counter=0
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Concatenate calls shard path .tsv files to pass in as --calls-shard-path, List of paths to GermlineCNVCaller model directories.:
calls_shard_list=$(printf -- "--calls-shard-path %s " "${GermlineCNVCaller_dir}/germline_cnv_caller_cohort-calls"/*)
echo "${calls_shard_list}"
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Concatenate models shard path .tsv files to pass in as --model-shard-path, List of paths to GermlineCNVCaller model directories.:
model_shard_list=$(printf -- "--model-shard-path %s " "${GermlineCNVCaller_dir}/germline_cnv_caller_cohort-model"/*)
echo "${model_shard_list}"
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Concatenate all the counts files in an array
Counts_Arguments_To_Be_Fed_Into_Step_7_GermlineCNVCaller=("${counts_dir}"/*.tsv)
# Concatenate all the shard files in an array
Shard_Files_To_Be_Fed_Into_Step_7_GermlineCNVCaller=("${SplitIntervals_dir}"/*.interval_list)
# Step 7: GermlineCNVCaller
echo " "
echo "Starting Step 7: GermlineCNVCaller"
echo " "
function run_GermlineCNVCaller() {
localannotated_intervals_list=$1
# Loop through each shard file path in Shard_Files_To_Be_Fed_Into_Step_7_GermlineCNVCaller
forshard_filein"${Shard_Files_To_Be_Fed_Into_Step_7_GermlineCNVCaller[@]}"; do
forcounts_argin"${Counts_Arguments_To_Be_Fed_Into_Step_7_GermlineCNVCaller[@]}"; do
echo"Processing: $counts_arg with annotated intervals: ${annotate_dir}/Annotate_Intervals.tsv"
shard_name=$(basename "$shard_file")
singularitycommand="singularity exec $singularity_image /gatk/gatk"
${singularitycommand} GermlineCNVCaller \
--contig-ploidy-calls"${DetermineGermlineContigPloidy_dir}/ploidy-calls"\
-I"$counts_arg"\
-L"$shard_file"\
--annotated-intervals"${annotate_dir}/Annotate_Intervals.tsv"\
-output"${GermlineCNVCaller_dir}"\
--output-prefixStep_7___GermlineCNVCaller___${counts_arg}_${shard_name}______\
--interval-merging-ruleOVERLAPPING_ONLY\
--run-modeCOHORT\
--verbosityDEBUG
done
done
}
run_GermlineCNVCaller
# Wait for all background processes to finish
waitThe the PostprocessGermlineCNVCalls will be run but as the error is occurring at GermlineCNVCaller I have only included the code up to the GermlineCNVCaller.Why is this occurring? How would I begin to fix it? This is using the GATK 4.4.0.0 singularity. When I paste code here the spacing seems to significantly different from the way it appears in my script but the commands are the same. -
SkyWarrior I am running into a similar issue as Y R..any chance you've seen this?
Please sign in to leave a comment.
42 comments