Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GATK "At least one interval must remain after intersection" For `FilterIntervals` Command

0

42 comments

  • Avatar
    David Roazen

    Hi Y R,

    The FilterIntervals tool is first taking the intersection of the -L intervals with the annotated intervals, and then taking the intersection of that with the intervals from the TSV provided to -I. A couple of suggestions for things to check:

    1. Could you please confirm that both the interval_list file output by PreprocessIntervals and the TSV output by CollectReadCounts are non-empty?

    2. Assuming they're both non-empty, could you also check the contig names in all three of these inputs:

    -L "${preprocessed_interval_list_dir}/${bam_filename}_targets_preprocessed.interval_list" \
    --annotated-intervals "${annotate_dir}/${bam_filename}_annotated.tsv" \
    -I "${counts_dir}/${bam_filename}_counts.tsv"

    and confirm that the contig names are compatible? If one of them is using a different contig naming scheme, or a different reference, the intersection will be empty.

    Regards,
    David

    0
    Comment actions Permalink
  • Avatar
    Y R

    Hello David, They are not empty 

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Hi Y R

    Is that only a single sample provided to the FilterIntervals tool? You need to provide all related samples at the same time to this tool not in a for loop seperately. 

    0
    Comment actions Permalink
  • Avatar
    David Roazen

    Hi Y R,

    In the original version of your post, you said "I am not sure why but on a second pass of my code the Filter Intervals command seems to have worked." Can you please clarify whether it's working for you now, or are you still getting the same error?

    Also, have you checked the contig names in the interval_list and the two TSVs to make sure that they are compatible?

    Regards,

    David

    0
    Comment actions Permalink
  • Avatar
    Y R

    Yes I found that it worked with preprocessed intervals fed into filter intervals but I am feeding in only one sample at a time and therefore I am not sure if I am doing this correctly and therefore I was hesitant to respond as I am unsure if I have solved the issues in the code I submitted above.

    0
    Comment actions Permalink
  • Avatar
    Hayden Shinn

    I am working with Y R on this script and adjusted the script based on our understanding of SkyWarrior comment on providing the directory containing all batch samples at once rather than one sample at a time via for loop. 

    echo "I entered the BAM file loop"

    echo "Bam directory is: $BAM_DIR"

    echo "Starting Step 1: Preprocessing"

    # Extract the BAM file name without the extension

    bam_filename=$(basename "${bam_file}")

    bam_filename="${bam_filename%.*}"

    echo "My bam file names it ${bam_filename}"

    mkdir -p "${output_dir}/Preprocessed_Interval_Lists"

    preprocessed_interval_list_dir="${output_dir}/Preprocessed_Interval_Lists"

    # Step 1: Preprocessing

    singularity exec $singularity_image /gatk/gatk --java-options -Xms8000m \

    PreprocessIntervals \

    -R ${reference} \

    -L ${targets_interval_list} \

    --bin-length ${NUM_BINS} \

    --interval-merging-rule OVERLAPPING_ONLY \

    -O "${preprocessed_interval_list_dir}/"

    echo "Starting Step 2: AnnotateIntervals with GC content"

    mkdir -p "${output_dir}/AnnotateIntervals"

    annotate_dir="${output_dir}/AnnotateIntervals"

    # Step 2: AnnotateIntervals with GC content

    singularity exec $singularity_image /gatk/gatk --java-options -Xms8000m \

    AnnotateIntervals \

    -L "${preprocessed_interval_list_dir}/" \

    -R ${reference} \

    --interval-merging-rule OVERLAPPING_ONLY \

    -O "${annotate_dir}/" \

    --mappability-track "${bed_file}"

    echo "Starting Step 3: CollectReadCounts"




    mkdir -p "${output_dir}/CollectReadCounts"

    counts_dir="${output_dir}/CollectReadCounts"

    # Step 3: Collect read counts for the samples

    singularity exec $singularity_image /gatk/gatk --java-options -Xms8000m \

    CollectReadCounts \

    -L "${preprocessed_interval_list_dir}/" \

    -R ${reference} \

    --interval-merging-rule OVERLAPPING_ONLY \

    -I "${BAM_DIR}/" \

    --format TSV \

    -O "${counts_dir}/"

    # Step 4: FilterIntervals based on GC-content and read

    mkdir -p "${output_dir}/Filtered_Intervals"

    filtered_intervals_dir="${output_dir}/Filtered_Intervals"

    singularity exec $singularity_image /gatk/gatk --java-options -Xms8000m \

    FilterIntervals \

    -L "${preprocessed_interval_list_dir}/" \

    --annotated-intervals "${annotate_dir}/" \

    -I "${counts_dir}/" \

    --interval-merging-rule OVERLAPPING_ONLY \

    -O "${filtered_intervals_dir}/"

    echo "Finished up to FilterIntervals"
    The error below is implying that we need to point to a specific input sample/file rather than all files at once. 
     
    A USER ERROR has occurred: Couldn't read file. Error was: {path/to/file} with exception: Cannot read file because it is a directory: file:///path/to/bams/

    ***********************************************************************
    Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
    Using GATK jar /gatk/gatk-package-4.4.0.0-local.jar
    Running:
      java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms8000m -jar /gatk/gatk-package-4.4.0.0-local.jar CollectReadCounts -L path/Preprocessed_Interval_Lists/ -R /path/Reference/Homo_sapiens/reference_fasta --interval-merging-rule OVERLAPPING_ONLY -I /path/to/bams/ --format TSV -O /path/CollectReadCounts/
    20:47:19.975 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    20:47:20.019 INFO  FilterIntervals - ------------------------------------------------------------
    20:47:20.024 INFO  FilterIntervals - The Genome Analysis Toolkit (GATK) v4.4.0.0
    20:47:20.024 INFO  FilterIntervals - For support and documentation go to https://software.broadinstitute.org/gatk/
    20:47:20.025 INFO  FilterIntervals - Executing as user@node41 on Linux v4.19.0-20-amd64 amd64
    20:47:20.026 INFO  FilterIntervals - Java runtime: OpenJDK 64-Bit Server VM v17.0.6+10-Ubuntu-0ubuntu118.04.1
    20:47:20.027 INFO  FilterIntervals - Start Date/Time: June 8, 2023 at 8:47:19 PM GMT
    20:47:20.027 INFO  FilterIntervals - ------------------------------------------------------------
    20:47:20.028 INFO  FilterIntervals - ------------------------------------------------------------
    20:47:20.029 INFO  FilterIntervals - HTSJDK Version: 3.0.5
    20:47:20.029 INFO  FilterIntervals - Picard Version: 3.0.0
    20:47:20.030 INFO  FilterIntervals - Built for Spark Version: 3.3.1
    20:47:20.031 INFO  FilterIntervals - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    20:47:20.031 INFO  FilterIntervals - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    20:47:20.032 INFO  FilterIntervals - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    20:47:20.032 INFO  FilterIntervals - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    20:47:20.033 INFO  FilterIntervals - Deflater: IntelDeflater
    20:47:20.034 INFO  FilterIntervals - Inflater: IntelInflater
    20:47:20.034 INFO  FilterIntervals - GCS max retries/reopens: 20
    20:47:20.035 INFO  FilterIntervals - Requester pays: disabled
    20:47:20.036 INFO  FilterIntervals - Initializing engine
    20:47:20.037 INFO  FilterIntervals - Done initializing engine
    20:47:20.041 INFO  FilterIntervals - Shutting down engine
    [June 8, 2023 at 8:47:20 PM GMT] org.broadinstitute.hellbender.tools.copynumber.FilterIntervals done. Elapsed time: 0.00 minutes.
    Runtime.totalMemory()=8388608000
    ***********************************************************************
     
    1
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Hi Hayden Shinn

    You need to feed all readcount files with -I switch seperately or inside an arguments file to gatk FilterIntervals tool. 

    You may use a loop to string concatenate your readcount files with -I prefix or you may prepare an arguments file like this and feed this file using --arguments-file parameter to gatk. 

    -I readcount1.tsv
    -I readcount2.tsv
    -I readcount3.tsv
    ...
    0
    Comment actions Permalink
  • Avatar
    Y R

    The error occurred any time we feed in a directory since we are dealing with samples done in groups/batches we do not know how many samples will be fed in at a time. If I read your previous posting correctly SkyWarrior we have to provide the entire batch each time but we cannot do this with one bam at a time?

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Anything in between FilterIntervals to GermlineCNVCalls (Including end tasks) readcount files must be fed using -I switch all at the same time. That means if you have 200 samples to feed you need to use -I switch 200 times in your commandline or you may generate a input parameter file as I indicated that contains all 200 readcount files with -I prefix per line. 

    I used a simple perl script to collect all readcountfiles in an array and add them to the commandline automatically. You may try something similar.  

    0
    Comment actions Permalink
  • Avatar
    Y R

    That is an interesting way to feed it in - I was wondering how you added them to command line automatically?

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Pretty rough but it does the job fine. 

    #!/usr/bin/perl
    use warnings;
    use strict;
    use POSIX;

    ##Collect all hdf5 readcounts into an array
    my @filelist = glob("*.hdf5");
    ##

    (my $name, my $pass, my $uid, my $gid, my $quota, my $comment, my $gcos, my $dir, my $shell, my $expire) = getpwnam(getlogin());
    my $intervalfile = $ARGV[4];
    my $annovatedfile = $ARGV[5] or die "Missing interval and annotation files";
    my $cpucount = $ARGV[6] or die "choose the number of cpus to run";
    my $gatkver = $ARGV[7] || "4.4.0.0";
    my $samplecount = scalar(@filelist);

    my $counter = 0;

    my $dockercommand = "docker run --rm -t -u ".$uid.":".$gid." --cpus ".$cpucount." -v \$PWD:/home/\${PWD##*/} -w /home/\${PWD##*/} -e THEANO_FLAGS=\"base_compiledir=/home/\${PWD##*/}/tmp\" broadinstitute/gatk:".$gatkver." ";
    my $FilterIntCommand= $dockercommand."gatk FilterIntervals --tmp-dir tmp -L ".$intervalfile." --annotated-intervals ".$annovatedfile." -imr OVERLAPPING_ONLY -O filtered.interval_list ";
    my $DCPcommand = $dockercommand."gatk DetermineGermlineContigPloidy --tmp-dir tmp --contig-ploidy-priors contig_priors.tsv --output DCP --output-prefix CP_CNV --verbosity DEBUG -imr OVERLAPPING_ONLY -L filtered.interval_list ";
    my $CNVcallcommand = $dockercommand."gatk GermlineCNVCaller --tmp-dir tmp --run-mode COHORT --contig-ploidy-calls ./DCP/CP_CNV-calls --output GCNV --output-prefix CNV --annotated-intervals ".$annovatedfile." --verbosity DEBUG -L filtered.interval_list -imr OVERLAPPING_ONLY --max-training-epochs 100 --initial-temperature 2.0 --convergence-snr-trigger-threshold 0.2 ";
    my $postprocesscommand = $dockercommand."gatk PostprocessGermlineCNVCalls --tmp-dir tmp --calls-shard-path GCNV/CNV-calls --model-shard-path GCNV/CNV-model --autosomal-ref-copy-number 2 --allosomal-contig chrX --allosomal-contig chrY --contig-ploidy-calls DCP/CP_CNV-calls -imr OVERLAPPING_ONLY -verbosity DEBUG ";

    ##Add each hdf5 file to the end of the commandline string with -I prefix
    for (@filelist)
    {
    $FilterIntCommand .= "-I ".$_." ";
    $DCPcommand .= "-I ".$_." ";
    $CNVcallcommand .= "-I ".$_." ";
    }

    system("mkdir -p tmp");

    if($ARGV[0] == 1)
    {
    system($FilterIntCommand);
    }

    if($ARGV[1] == 2)
    {
    system($DCPcommand);
    }

    if($ARGV[2] == 3)
    {
    system($CNVcallcommand);
    }

    if($ARGV[3] >= 4 && $ARGV[3] <= 5)
    {
    ##for each sample run postprocess seperately and generate segments.vcf files.
    while ($counter < $samplecount)
    {
    my$postprocesscommandfinal = $postprocesscommand."--sample-index ".$counter." --output-genotyped-intervals ".$filelist[$counter]."_interval.vcf.gz --output-genotyped-segments ".$filelist[$counter]."_segments.vcf.gz --output-denoised-copy-ratios ".$filelist[$counter]."denoised-copy-ratios.txt";
    system($postprocesscommandfinal);
    $counter++;
    }

    }

     

    0
    Comment actions Permalink
  • Avatar
    Y R

    Interesting and you feed this in in between the beginning of the command and the end. Thank you.

    0
    Comment actions Permalink
  • Avatar
    Hayden Shinn

    Hi there, we are still working on this pipeline but seem to be hitting an error around PostProcessingGermlineCNVCalls step as we aren't getting an output for that but getting outputs for every previous step. A few questions about `PostProcessingGermlineCNVCalls` :

    • What intervals should be passed in for the `-L` interval parameter?
    • Do these paths look correct or is anything noticeable wrong with the code? SkyWarrior

     

    # Function to add the "-I" flag at the beginning of each line of the array list

    add_I_flag() {

    local input_array=("$@")

    printf -- "-I %s\n""${input_array[@]}"

    }

    # Function to add the "-L" flag at the beginning of each line of the array list

    add_L_flag() {

    local input_array=("$@")

    printf -- "-L %s\n""${input_array[@]}"

    }

    #BAM files

    bam_files=("${BAM_DIR}"/*.bam)

    bam_filenames=($(basename "${bam_files[@]}" .bam))

    #______________________________________________________________________________________________________________________________________________________________

    # Step 1: Preprocessing

    mkdir -p "${output_dir}/Preprocessed_Interval_Lists"

    preprocessed_interval_list_dir="${output_dir}/Preprocessed_Interval_Lists"

    output1=$(singularity exec $singularity_image /gatk/gatk --java-options -Xms10g \

    PreprocessIntervals \

    -R "${reference}" \

    -L "${targets_interval_list}" \

    --bin-length "${NUM_BINS}" \

    --interval-merging-rule OVERLAPPING_ONLY \

    -O "${preprocessed_interval_list_dir}/preprocessed_intervals.interval_list")

    `echo "$output1"`

    preprocessed_output+=("$output1")

    # Step 2: AnnotateIntervals with GC content

    mkdir -p "${output_dir}/AnnotateIntervals"

    annotate_dir="${output_dir}/AnnotateIntervals"

    output2=$(singularity exec $singularity_image /gatk/gatk --java-options -Xms10g \

    AnnotateIntervals \

    -L "${preprocessed_interval_list_dir}/preprocessed_intervals.interval_list" \

    -R "${reference}" \

    --interval-merging-rule OVERLAPPING_ONLY \

    -O "${annotate_dir}/annotated_intervals.interval_list" \

    --mappability-track "${bed_file}")

    `echo "$output2"`

    annotated_output+=("$output2")

    # Step 3: CollectReadCounts

    mkdir -p "${output_dir}/CollectReadCounts"

    counts_dir="${output_dir}/CollectReadCounts"

    for bam_file in "${bam_files[@]}"; do

    bam_filename=$(basename "$bam_file")

    bam_filename="${bam_filename%.*}"

    echo"${bam_filename} being run"

    output3=$(singularity exec $singularity_image /gatk/gatk --java-options -Xms10g \

    CollectReadCounts \

    -I "${bam_file}" \

    -L "${preprocessed_interval_list_dir}/preprocessed_intervals.interval_list" \

    -R "${reference}" \

    --interval-merging-rule OVERLAPPING_ONLY \

    -O "${counts_dir}/${bam_filename}_counts.tsv")

    echo"$output3"

    counts_output+=("$output3")

    done

    # Step 4: FilterIntervals based on GC-content and read

    mkdir -p "${output_dir}/Filtered_Intervals"

    filtered_intervals_dir="${output_dir}/Filtered_Intervals"

    # Concatenate all the counts files with -I flag

    counts_arguments=$(printf -- "-I %s " "${counts_dir}"/*.tsv)

    output4=$(singularity exec $singularity_image /gatk/gatk --java-options -Xms10g \

    FilterIntervals \

    ${counts_arguments} \

    -L "${preprocessed_interval_list_dir}/preprocessed_intervals.interval_list" \

    --annotated-intervals "${annotate_dir}/annotated_intervals.interval_list" \

    --interval-merging-rule OVERLAPPING_ONLY \

    -O "${filtered_intervals_dir}/filtered_intervals.interval_list")

    filtered_output+=("$output4")

    echo "Finished up to FilterIntervals"

    # Step 5: DetermineGermlineContigPloidy

    # Making a Determine Germline Contig Ploidy Folder

    mkdir -p "${output_dir}/DetermineGermlineContigPloidy"

    DetermineGermlineContigPloidy_dir="${output_dir}/DetermineGermlineContigPloidy"

    singularity exec $singularity_image /gatk/gatk --java-options -Xms10g \

    DetermineGermlineContigPloidy \

    ${counts_arguments} \

    --contig-ploidy-priors ${contig_ploidy_priors} \

    --interval-merging-rule OVERLAPPING_ONLY \

    --output-prefix ploidy \

    --output ${DetermineGermlineContigPloidy_dir}

    # Step 6: SplitIntervals

    echo "Creating shards via bins"

    # Make a split intervals folder

    mkdir -p ${output_dir}/Split_Intervals

    SpllitIntervals_dir=${output_dir}/Split_Intervals




    # run SplitIntervals with shard name FIX




    singularity exec $singularity_image /gatk/gatk --java-options -Xms10g \

    SplitIntervals \

    -R ${reference} \

    -L "${preprocessed_interval_list_dir}/preprocessed_intervals.interval_list" \

    --scatter-count ${NUM_BINS} \

    --interval-merging-rule OVERLAPPING_ONLY \

    -O "${output_dir}/Split_Intervals/"

    mkdir -p "${output_dir}/GermlineCNVCaller"

    GermlineCNVCaller_dir="${output_dir}/GermlineCNVCaller"

    # Concatenate all the split interval files with -L flag

    shard_arguments=$(printf -- "-L %s " "${SpllitIntervals_dir}"/*.interval_list)

    # Step 7) GermlineCNVCaller (Mandatory) attempting without forloop

    singularity exec $singularity_image /gatk/gatk --java-options -Xms15g \

    GermlineCNVCaller \

    ${shards_arguments} \

    --interval-merging-rule OVERLAPPING_ONLY \

    --contig-ploidy-calls "${DetermineGermlineContigPloidy_dir}/ploidy-calls" \

    ${counts_arguments} \

    --annotated-intervals "${annotate_dir}/annotated_intervals.interval_list" \

    --output "${GermlineCNVCaller_dir}" \

    --output-prefix germline_cnv_caller_cohort \

    --run-mode COHORT




    # Step 8) Step PostProcessGermlineCNVCalls (Mandatory) FIX [copied from gatk tool index]

    #Concatenate calls shard path to pass in as --calls-shard-path parameter, List of paths to GermlineCNVCaller call directories.

    #Concatenate models shard path to pass in as --model-shard-path, List of paths to GermlineCNVCaller model directories.

    mkdir -p "${output_dir}/PostprocessGermlineCNVCalls"

    PostprocessGermlineCNVCalls_dir="${output_dir}/PostprocessGermlineCNVCalls"

    singularity exec $singularity_image /gatk/gatk --java-options -Xms10g \

    PostprocessGermlineCNVCalls \

    --calls-shard-path "${GermlineCNVCaller_dir}/germline_cnv_caller_cohort-calls/interval.tsv" \

    --contig-ploidy-calls "${DetermineGermlineContigPloidy_dir}/ploidy-calls" \

    --model-shard-path "${GermlineCNVCaller_dir}/germline_cnv_caller_cohort-model/interval.tsv" \

    --intervals "${annotate_dir}/annotated_intervals.interval_list" \

    --sample-index 0 \

    --autosomal-ref-copy-number 2 \

    --allosomal-contig X \

    --allosomal-contig Y \

    --sequence-dictionary ${sequence_dict} \

    --interval-merging-rule OVERLAPPING_ONLY \

    --create-output-variant-index \

    --output-genotyped-intervals "${PostprocessGermlineCNVCalls_dir}/name_cohort_intervals.clustered.vcf" \

    --output-genotyped-segments "${PostprocessGermlineCNVCalls_dir}/name_cohort_segments.clustered.vcf" \

    --output-denoised-copy-ratios "${PostprocessGermlineCNVCalls_dir}/name_cohort_copy_ratios.clustered.tsv" \

    -R ${reference} \

    --version
    And here is the end of the out file (proprietary paths omitted) with the PostProcessing Step
    The Genome Analysis Toolkit (GATK) v4.4.0.0
    HTSJDK Version: 3.0.5
    Picard Version: 3.0.0
    Using GATK jar /gatk/gatk-package-4.4.0.0-local.jar
    Running:
      java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms10g -jar /gatk/gatk-package-4.4.0.0-local.jar \
    PostprocessGermlineCNVCalls \
    --calls-shard-path /path/to/GermlineCNVCaller/germline_cnv_caller_cohort-calls/interval.tsv \
    --contig-ploidy-calls /path/to/DetermineGermlineContigPloidy/ploidy-calls \
    --model-shard-path /path/to/GermlineCNVCaller/germline_cnv_caller_cohort-model/interval.tsv \
    --intervals path/to/AnnotateIntervals/annotated_intervals.interval_list \
    --sample-index 0 \
    --autosomal-ref-copy-number 2 \
    --allosomal-contig X \
    --allosomal-contig Y \
    --sequence-dictionary /path/sequence/dictionary.dict \
    --interval-merging-rule OVERLAPPING_ONLY \
    --create-output-variant-index \
    --output-genotyped-intervals /path/to/PostprocessGermlineCNVCalls/name_cohort_intervals.clustered.vcf
    --output-genotyped-segments /path/to/PostprocessGermlineCNVCalls/name_cohort_segments.clustered.vcf
    --output-denoised-copy-ratios path/to/PostprocessGermlineCNVCalls/name_cohort_copy_ratios.clustered.tsv
    -R /path/to/ref/ref.fa
    --version


    1
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Hi Hayden Shinn

    Sorry for being late to answer. There is no need for -L parameter for PostProcessGermlineCNVCalls as all interval parameters were already consumed by GermlineCNVCaller. For the shards and model you need to provide the path to the GermlineCNVCaller output+prefix path. If say your output folder for GermlineCNVCaller is ./output and prefix is GCNV then you need to provide ./output/GCNV-calls as the shards path and ./output/GCNV-model as the model path. You can also check these folders to see if they are created properly. 

    There is no need for a specific file to be added there. PostProcessGermlineCNVCalls will consume all the data in that prefix according to the sample id. So for each sample you need to call PostProcessGermlineCNVCalls again and again only by changing the sample-index. Later on you will have the opportunity to joint call all segments to generate a all sample vcf file for CNV calls. 

    0
    Comment actions Permalink
  • Avatar
    Y R

    So do we need to change the following each time?:

    --sample-index 0 \
    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Exactly

    0
    Comment actions Permalink
  • Avatar
    Y R

    Could I ask for how you did this?

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    In my perl file I keep 2 variables: one for the total number of samples and the other is a simple integer starts at 0. I use these to run a loop to go over each sample during the post process step. 

    my $samplecount = scalar(@filelist);

    my $counter = 0;
    ##for each sample run postprocess seperately and generate segments.vcf files.
    while ($counter < $samplecount)
    {
    my $postprocesscommandfinal = $postprocesscommand."--sample-index ".$counter." --output-genotyped-intervals ".$filelist[$counter]."_interval.vcf.gz --output-genotyped-segments ".$filelist[$counter]."_segments.vcf.gz --output-denoised-copy-ratios ".$filelist[$counter]."denoised-copy-ratios.txt";
    system($postprocesscommandfinal);
    $counter++;
    }


     

    0
    Comment actions Permalink
  • Avatar
    Y R

    Thank you. For this counter based on `-sample-index` that is the shard path to the directory created by SplitIntervals?

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    --sample-index is the ID that gatk assigns to your sample when readcounts are loaded to DetermineContigPloidy and GermlineCNVCaller. Below is a sample from GermlineCNVCaller log

    16:11:57.578 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes00" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_0"...
    16:12:04.345 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes01" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_1"...
    16:12:11.021 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes02" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_2"...
    16:12:17.437 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes03" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_3"...
    16:12:24.182 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes04" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_4"...
    16:12:30.760 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes05" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_5"...
    16:12:37.199 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes06" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_6"...
    16:12:43.930 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes07" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_7"...
    16:12:50.274 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes08" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_8"...
    16:12:56.635 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes09" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_9"...
    16:13:03.011 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes10" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_10"...
    16:13:09.341 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes11" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_11"...
    16:13:15.669 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes12" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_12"...
    16:13:22.375 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes13" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_13"...
    16:13:28.738 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes14" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_14"...
    16:13:35.755 INFO gcnvkernel.io.io_denoising_calling - Saving posteriors for sample "Wes15" in "/home/GATK_CNV_MGI/GCNV/CNV-calls/SAMPLE_15"...

    Here --sample-index 0 belongs to Wes00 sample since I loaded Wes00.hdf5 file first. Each sample must be seperately processed by the post process algorithm to generate corresponding intervals segments and denoised ratios files. 

     

    Shards for the processed intervals by the tools are different and each shard contains all samples processed with the same --sample-index IDs

    0
    Comment actions Permalink
  • Avatar
    Y R

    Hello I am running the entire script but I am not getting any output. What am I doing incorrectly? Is the `--sample-index` in `PostProcessGermlineCNVCalls` based on the shards:

     

    #_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    # Create variable `sample_counter` to identify/index number of samples in the call.

    num_shards=("${GermlineCNVCaller_dir}/germline_cnv_caller_cohort-calls"/*/*)

    # Create an array with all files in the GerminlineCNVcaller-model-shards directory

    samplecount=${#num_shards[@]} # Get the count of files

    counter=0

    #_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    # Concatenate calls shard path .tsv files to pass in as --calls-shard-path, List of paths to GermlineCNVCaller model directories.:

    calls_shard_list=$(printf -- "--calls-shard-path %s " "${GermlineCNVCaller_dir}/germline_cnv_caller_cohort-calls"/*)

    echo "${calls_shard_list}"

    #_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    # Concatenate models shard path .tsv files to pass in as --model-shard-path, List of paths to GermlineCNVCaller model directories.:

    model_shard_list=$(printf -- "--model-shard-path %s " "${GermlineCNVCaller_dir}/germline_cnv_caller_cohort-model"/*)

    echo "${model_shard_list}"

    #_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    # Step 8: PostProcessGermlineCNVCalls (Mandatory)

    echo " "

    echo "Starting Step 8: PostProcessGermlineCNVCalls"

    echo " "

    echo " "

    echo " The number of shards: $samplecount"

    echo " "

    while [ $counter -lt $samplecount ]

    do

    echo"Processing sample $counter..."

    singularityexec$singularity_image/gatk/gatk--java-options-Xms10G\
    PostprocessGermlineCNVCalls\
    ${calls_shard_list} \
    --contig-ploidy-calls"${DetermineGermlineContigPloidy_dir}/ploidy-calls"\
    ${model_shard_list} \
    --sample-index$counter\
    --autosomal-ref-copy-number2\
    --allosomal-contigX\
    --allosomal-contigY\
    --sequence-dictionary ${sequence_dict} \
    --interval-merging-ruleOVERLAPPING_ONLY\
    --create-output-variant-index\
    --output-genotyped-intervals"${PostprocessGermlineCNVCalls_dir}/genotyped_intervals.vcf.gz"\
    --output-genotyped-segments"${PostprocessGermlineCNVCalls_dir}/genotyped_segments.vcf.gz"\
    --output-denoised-copy-ratios"${PostprocessGermlineCNVCalls_dir}/denoised_copy_ratios.tsv"\
    -R ${reference} \
    --version

    ((counter++))

    echo"Sample $counter processed."

    done

    echo " "

    echo "Finished with all the required steps"

    echo " "

    #____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
     
    I am only attaching the code for a specific section above. My `echo` statements are a check and when I ran the script is ended with no error but it did not have anything in the `PostProcessGermlineCNVCalls` directory.
    1
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Hi Y R

    --sample-index is not based on shards but based on the order and the number of samples processed in GermlineCNVCaller step. If you have 200 samples you need to start your --sample-index from 0 and process all samples all the way up to 199. shards are the parts of your interval_list that you divided into subparts to process your calls. Those shards are then united at the post processing step therefore they act as single unit. 

    0
    Comment actions Permalink
  • Avatar
    Y R

    I am curious if I am doing something incorrectly at the following 2 steps as it generates no error but it take an exceedingly long time (over 12 hours) to run 5 bam files through. It seems that it is doing something but it stalls at the following step:

    Last Output In The .out file:

    ` Aggregating read-count path/to/.bam.counts.tsv (5 / 5)`

    The code associated:

    echo "Starting Step 6: SplitIntervals - Creating shards"

    # Step 6: SplitIntervals    
    singularity exec $singularity_image /gatk/gatk --java-options -Xms"$memory" \
    SplitIntervals \
      -R ${reference} \
      ${filtered_intervals_list} \
      ${bam_arguments} \
      --scatter-count ${NUM_BINS} \
      --interval-merging-rule OVERLAPPING_ONLY \
      -O "${SplitIntervals_dir}"

    # Concatenate all the split interval files with -L flag
    shard_arguments=$(printf -- "-L %s " "${SplitIntervals_dir}"/*.interval_list)

    #_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    echo "Starting Step 7: GermlineCNVCaller"

    # Step 7) GermlineCNVCaller (Mandatory) attempting without forloop
    singularity exec $singularity_image /gatk/gatk --java-options -Xms15g \
    GermlineCNVCaller \
      ${shards_arguments} \
      --interval-merging-rule OVERLAPPING_ONLY \
      --contig-ploidy-calls "${DetermineGermlineContigPloidy_dir}/ploidy-calls" \
      ${counts_arguments} \
      ${annotated_intervals_list} \
      --output "${GermlineCNVCaller_dir}" \
      --output-prefix germline_cnv_caller_cohort \
      --run-mode COHORT 
      --verbosity DEBUG

     

    Is there something incorrect? It does go past this to the `PostprocessGermlineCNVCalls` but it is quite slow. Is this normal? Is there something wrong with the code written?

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    My question is if you will use all shards all at the same time in the same GermlineCNVCaller step why would you create them anyway. A single interval_list would suffice and it will create a single call shard. If your interval_list is too big such as a whole genome interval_list split at 1000bp intervals or so then creating shards is a good idea to accelerate the process but this time each split interval_list must be used seperately at the GermlineCNVCaller step. This will create different germlinecnvcaller shards but these shards then can be used all together at the post processing step. 

    I am using this approach when I need to analyze whole genome samples. 

     

    0
    Comment actions Permalink
  • Avatar
    Y R

    Is possible to see your example code for what to do for the GermlineCNVCaller and shards?

    1
    Comment actions Permalink
  • Avatar
    SkyWarrior

    I know that this is totally out of scope for GATK and its accessory partners such as cromwell and wdl but for the sake of getting an answer I will post my perl script here with details. You are free to use it as you see fit. This perl script is designed for whole genome CNV analysis

    I put all count files and preprocessed interval_list file and its annotation file and the contig_ploidy.tsv file in the same folder and within the folder I run the following perl script to completion.

    use warnings;
    use strict;

    #Collect the list of all count files. I use HDF5 format.
    my @filelist = glob("*.hdf5");

    #Preprocessed interval_list and its annotation files are taken as parameters here
    my $intervalfile = $ARGV[4];
    my $annovatedfile = $ARGV[5] or die "Missing interval and annotation files";

    #select the number of cpu threads dedicated for this task to limit containers
    my $cpucount = $ARGV[6] or die "choose the number of cpus to run";

    #Keep the number of samples as a integer count
    my $samplecount = scalar(@filelist);
    my $counter = 0;

    #Core docker command to run all other commands
    my $dockercommand = "docker run --rm -t -u 1000:1000 --cpus ".$cpucount." -v \$PWD:/home/\${PWD##*/} -w /home/\${PWD##*/} -e THEANO_FLAGS=\"base_compiledir=/home/\${PWD##*/}/tmp\" broadinstitute/gatk:4.4.0.0 ";

    #Core FilterIntervals command to prefilter intervals based on all count files.
    my $FilterIntCommand= $dockercommand."gatk FilterIntervals --tmp-dir tmp -L ".$intervalfile." --annotated-intervals ".$annovatedfile." -imr OVERLAPPING_ONLY -O filtered.interval_list ";

    #Core DetermineGermlineContigPloidy command
    my $DCPcommand = $dockercommand."gatk DetermineGermlineContigPloidy --tmp-dir tmp --contig-ploidy-priors contig_priors.tsv --output DCP --output-prefix CP_CNV --verbosity DEBUG -imr OVERLAPPING_ONLY -L filtered.interval_list ";

    #Core GermlineCNVCaller command
    my $CNVcallcommand = $dockercommand."gatk GermlineCNVCaller --tmp-dir tmp --run-mode COHORT --contig-ploidy-calls ./DCP/CP_CNV-calls --output GCNV --annotated-intervals ".$annovatedfile." --verbosity DEBUG -imr OVERLAPPING_ONLY --max-training-epochs 100 ";

    #Core PostprocessGermlineCNVCalls command
    my $postprocesscommand = $dockercommand."gatk PostprocessGermlineCNVCalls --tmp-dir tmp --autosomal-ref-copy-number 2 --allosomal-contig chrX --allosomal-contig chrY --contig-ploidy-calls DCP/CP_CNV-calls -imr OVERLAPPING_ONLY -verbosity DEBUG ";



    #For each sample add -I switch in a for loop. Order of the samples does not change since all come from the same array
    for (@filelist)
    {
        $FilterIntCommand .= "-I ".$_." ";
        $DCPcommand .= "-I ".$_." ";
        $CNVcallcommand .= "-I ".$_." ";
    }

    #Generate tmp directory for the use of Theano
    system("mkdir -p tmp");

    #Run Scatter intervals. If it was run before clear the previous ones and recreate them since it is trivial
    if($ARGV[0] == 1)
    {
        system($FilterIntCommand);
        system("rm -rf scatter");
        system("mkdir -p scatter");
        system("gatk IntervalListTools -I filtered.interval_list --SUBDIVISION_MODE INTERVAL_COUNT --SCATTER_COUNT 20 --OUTPUT scatter");
    }

    #Run DetermineGermlineContigPloidy
    if($ARGV[1] == 2)
    {
        system($DCPcommand);
    }



    #Collect all scatter interval lists in an array as a path
    my @scatterlist = glob("./scatter/temp*/*.interval_list");



    if($ARGV[2] == 3)
    {
    #For each scattered interval_list append -L switch seperately to GermlineCNVCaller and run seperately.
    #Each scatter must be run in a seperate task for GCNV step
        for(@scatterlist)
        {
            
            my @scatter = split('/', $_);
            my $scatterfileorder = $scatter[2];
            my $tempCNVcallcommand = $CNVcallcommand."-L ".$_." --output-prefix CNV_".$scatterfileorder;
            system($tempCNVcallcommand);
            system("rm ".$_);
        }
    }


    #Each scattered GCNV task generates its own calls and models as a shard. Collect them as paths in an array
    my @GCNVcallslist = glob("./GCNV/CNV_temp*calls");
    my @GCNVmodellist = glob("./GCNV/CNV_temp*model");

    #Get the number of shards
    my @count = (0..scalar(@GCNVcallslist)-1);

    if($ARGV[3] >= 4 && $ARGV[3] <= 5)
    {
        
        my $temppostprocesscommand = $postprocesscommand;
    #For each shard generated by GCNV step add them with --calls-shard-path and --model-shard-path switches to PostProcessGermlineCNVCalls core command
        for(@count)
        {
            $temppostprocesscommand .=" --calls-shard-path ".$GCNVcallslist[$_];
            $temppostprocesscommand .=" --model-shard-path ".$GCNVmodellist[$_];
        }    
        
        $temppostprocesscommand .=" ";

        #print $temppostprocesscommand."\n";

    #For each sample run the appended PostProcessGermlineCNVCalls command with its appropriate --sample-index value seperately
        while ($counter < $samplecount)
        {
            my $temppostprocesscommandfinal = $temppostprocesscommand."--sample-index ".$counter." --output-genotyped-intervals ".$filelist[$counter]."_interval.vcf.gz --output-genotyped-segments ".$filelist[$counter]."_segments.vcf.gz --output-denoised-copy-ratios ".$filelist[$counter]."denoised-copy-ratios.txt";
            system($temppostprocesscommandfinal);
            $counter++;
        }

    }

     

    I hope this helps. 

    0
    Comment actions Permalink
  • Avatar
    Y R

    Wait this doesn't include SplitIntervals to create the scatter? My code is below is this wrong:

     

    # Step 6: SplitIntervals:

    echo " "
    echo "Starting Step 6: SplitIntervals - Creating shards"
    echo " "

    singularity exec $singularity_image /gatk/gatk SplitIntervals \
    -R ${reference} \
    ${filtered_intervals_list} \
    ${bam_arguments} \
    --scatter-count ${NUM_BINS} \
    --interval-merging-ruleOVERLAPPING_ONLY\
    --output"${SplitIntervals_dir}"

    # Concatenate all the split interval files with -L flag:
    shard_arguments=$(printf -- "-L %s " "${SplitIntervals_dir}"/*.interval_list)

     

    That is the section of my script dealing with SplitIntervals. Is this wrong?

     

     

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Here is the code for the SplitIntervals. It is in the code actually. I am using IntervalListTools for this purpose. 

    #Run Scatter intervals. If it was run before clear the previous ones and recreate them since it is trivial
    if($ARGV[0] == 1)
    {
        system($FilterIntCommand);
        system("rm -rf scatter");
        system("mkdir -p scatter");
        system("gatk IntervalListTools -I filtered.interval_list --SUBDIVISION_MODE INTERVAL_COUNT --SCATTER_COUNT 20 --OUTPUT scatter");
    }
    0
    Comment actions Permalink
  • Avatar
    Y R

    I get the most peculiar error from the GermlineCNVCaller:

    23:17:28.757 INFO  IntervalArgumentCollection - Processing 15498753 bp from intervals
    23:17:28.761 INFO  GermlineCNVCaller - Reading and validating annotated intervals...
    23:17:30.382 INFO  GermlineCNVCaller - Shutting down engine

    [August 1, 2023 at 11:17:30 PM GMT] org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller done. Elapsed time: 0.43 minutes.

    Runtime.totalMemory()=1224736768

    java.lang.IllegalArgumentException: Annotated intervals do not contain all specified intervals.
    at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:798)
    at org.broadinstitute.hellbender.tools.copynumber.arguments.CopyNumberArgumentValidationUtils.validateAnnotatedIntervalsSubset(CopyNumberArgumentValidationUtils.java:261)
    at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.resolveIntervals(GermlineCNVCaller.java:406)
    at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.doWork(GermlineCNVCaller.java:329)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:149)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)

    Using GATK jar /gatk/gatk-package-4.4.0.0-local.jar

    [...paths removed]

    Processing: path/to/Step_3___CollectReadCounts/Sample_Name.bam.counts.tsv with annotated path/to/Outputs_GATKgCNV/Step_2___AnnotateIntervals/Annotate_Intervals.tsv

    23:17:33.617 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so

    23:17:33.725 DEBUG NativeLibraryLoader - Extracting libgkl_compression.so to /tmp/libgkl_compression10297239473819555856.so

    23:17:33.760 INFO  GermlineCNVCaller - ------------------------------------------------------------

     

    Why wouldn't the Annotate Intervals file contain all the intervals? I have placed the associated code block below:

     

    # BAM_DIR is the variable that contains the bam directory

    output_dir=${BAM_DIR}/Outputs_GATKgCNV
    mkdir ${output_dir}
    echo " "
    echo "The output directory is: $output_dir"
    echo " "

    #____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

    # Making a Preprocessed_Interval_Lists Folder:

    mkdir -p "${output_dir}/Step_1___Preprocessed_Interval_Lists"
    preprocessed_interval_dir="${output_dir}/Step_1___Preprocessed_Interval_Lists"

    # Making a AnnotateIntervals Folder:

    mkdir -p "${output_dir}/Step_2___AnnotateIntervals"
    annotate_dir="${output_dir}/Step_2___AnnotateIntervals"

    # Making a CollectReadCounts Folder:

    mkdir -p "${output_dir}/Step_3___CollectReadCounts"
    counts_dir="${output_dir}/Step_3___CollectReadCounts"

    # Making a Filtered Interval folder:

    mkdir -p "${output_dir}/Step_4___Filtered_Intervals"
    filtered_intervals_dir="${output_dir}/Step_4___Filtered_Intervals"

    # Making a Determine Germline Contig Ploidy Folder:

    mkdir -p "${output_dir}/Step_5___DetermineGermlineContigPloidy"
    DetermineGermlineContigPloidy_dir="${output_dir}/Step_5___DetermineGermlineContigPloidy"

    # Make a Split intervals folder:

    mkdir -p ${output_dir}/Step_6___Split_Intervals
    SplitIntervals_dir="${output_dir}/Step_6___Split_Intervals"

    # Making a GermlineCNVCaller Folder:

    mkdir -p "${output_dir}/Step_7___GermlineCNVCaller"
    GermlineCNVCaller_dir="${output_dir}/Step_7___GermlineCNVCaller"

    # Making a PostprocessGermlineCNVCalls Folder:

    mkdir -p "${output_dir}/Step_8___PostprocessGermlineCNVCalls"
    PostprocessGermlineCNVCalls_dir="${output_dir}/Step_8___PostprocessGermlineCNVCalls"

    echo " "
    echo "All sub-folders created"
    echo " "

    #____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

    # Step 1: Preprocessing

    echo " "
    echo "Starting Step 1: Preprocessing"
    echo " "

    singularity exec $singularity_image /gatk/gatk PreprocessIntervals \
    --reference ${reference} \
    --intervals ${targets_interval_list} \
    --bin-length ${NUM_BINS} \
    --interval-merging-ruleOVERLAPPING_ONLY\
    --output"${preprocessed_interval_dir}/PreprocessIntervals.interval_list"

    #_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    # Step 2: AnnotateIntervals with GC content

    echo " "
    echo "Starting Step 2: AnnotateIntervals with GC content"
    echo " "

    singularity exec $singularity_image /gatk/gatk AnnotateIntervals \
    --intervals"${preprocessed_interval_dir}/PreprocessIntervals.interval_list"\
    --reference ${reference} \
    --interval-merging-ruleOVERLAPPING_ONLY\
    --output"${annotate_dir}/Annotate_Intervals.tsv"\
    --mappability-track"${bed_file}"

    #_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    # Concatenate all the bam files with -I flag (done):

    echo " "
    echo "Making a -I flag for the bam directory"
    echo " "

    bam_arguments=$(printf -- "-I %s\n" "${BAM_DIR}"/*.bam)

    echo "${bam_arguments}"

    echo " "
    echo "Those were all the bam files with -I "
    echo " "

    # _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    # Step 3: Collect Read Counts:

    echo " "
    echo "Starting Step 3: CollectReadCounts"
    echo " "

    # Create an array to store the BAM file paths:

    bam_files=()

    # Iterate over each BAM file in BAM_DIR and add it to the array:

    for bam_file in "$BAM_DIR"/*.bam; do

    bam_files+=("$bam_file")

    done

    # Count the number of BAM files in the directory:

    num_bams=$(find "$BAM_DIR" -type f -name "*.bam" | wc -l)

    echo " "
    echo "Number of BAM files: $num_bams"
    echo " "

    # Run CollectReadCounts on each BAM file:

    for bam_file in "${bam_files[@]}"; do

    singularityexec$singularity_image/gatk/gatkCollectReadCounts\
    --input"$bam_file"\
    --intervals"${preprocessed_interval_dir}/PreprocessIntervals.interval_list"\
    --reference"${reference}"\
    --interval-merging-ruleOVERLAPPING_ONLY\
    --formatTSV\
    --output"${counts_dir}/$(basename "$bam_file").counts.tsv"

    done

    echo " "
    echo "Step 3: CollectReadCounts completed"
    echo " "

    #_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    # Concatenate all the counts files with -I flag:

    counts_arguments=$(printf -- "-I %s " "${counts_dir}"/*.tsv)

    echo " "
    echo "These are the CollectReadCounts file(s) with the flag -I below:"
    echo " "

    echo "${counts_arguments}"
    echo " "
    echo "This is all the CollectReadCounts file(s) with the flag -I."
    echo " "

    #_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    # Concatenate the interval list for --annotate-intervals parameter (should be 1 file but just in case):

    annotated_intervals_list=$(printf -- "--annotated-intervals %s " "${annotate_dir}"/*.tsv)
    echo "${annotated_intervals_list}"

    #_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    # Step 4: FilterIntervals based on GC-content and read:

    echo " "
    echo "Starting Step 4: FilterIntervals"
    echo " "

    singularity exec $singularity_image /gatk/gatk FilterIntervals \
    --intervals ${preprocessed_interval_dir}/PreprocessIntervals.interval_list\
    ${annotated_intervals_list} \
    ${counts_arguments} \
    --interval-merging-ruleOVERLAPPING_ONLY\
    --output"${filtered_intervals_dir}/Filter_Intervals.interval_list"

    echo "Finished up to FilterIntervals"

    #_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    # Step 5: DetermineGermlineContigPloidy:

    echo " "
    echo "Starting Step 5: DetermineGermlineContigPloidy"
    echo " "

    singularity exec $singularity_image /gatk/gatk DetermineGermlineContigPloidy \
    --intervals"${filtered_intervals_dir}/Filter_Intervals.interval_list"\
    --interval-merging-ruleOVERLAPPING_ONLY\
    ${counts_arguments} \
    --contig-ploidy-priors ${contig_ploidy_priors} \
    --output-prefixploidy\
    --output ${DetermineGermlineContigPloidy_dir}

    #_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    # Step 6: SplitIntervals:

    echo " "
    echo "Starting Step 6: SplitIntervals - Creating shards"
    echo " "

    singularity exec $singularity_image /gatk/gatk SplitIntervals \
    --reference ${reference} \
    ${filtered_intervals_list} \
    ${bam_arguments} \
    --scatter-count ${NUM_BINS} \
    --interval-merging-ruleOVERLAPPING_ONLY\
    --output"${SplitIntervals_dir}"

    # Concatenate all the split interval files with -L flag:

    shard_arguments=$(printf -- "-L %s " "${SplitIntervals_dir}"/*.interval_list)

    #_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    # Create variable `sample_counter` to identify/index number of samples in the call.

    samplecount=${num_bams} # Get the count of samples - 1 bam == 1 sample

    samplecount=$((samplecount - 1)) # Decrement samplecount by 1

    echo " "
    echo "The sample count used for the final step is one less than the number of bams."
    echo "The number of bams: $num_bams"
    echo "The sample count is: $samplecount"
    echo " "

    counter=0

    #_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    # Concatenate calls shard path .tsv files to pass in as --calls-shard-path, List of paths to GermlineCNVCaller model directories.:

    calls_shard_list=$(printf -- "--calls-shard-path %s " "${GermlineCNVCaller_dir}/germline_cnv_caller_cohort-calls"/*)
    echo "${calls_shard_list}"

    #_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    # Concatenate models shard path .tsv files to pass in as --model-shard-path, List of paths to GermlineCNVCaller model directories.:

    model_shard_list=$(printf -- "--model-shard-path %s " "${GermlineCNVCaller_dir}/germline_cnv_caller_cohort-model"/*)
    echo "${model_shard_list}"

    #_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    # Concatenate all the counts files in an array

    Counts_Arguments_To_Be_Fed_Into_Step_7_GermlineCNVCaller=("${counts_dir}"/*.tsv)

    # Concatenate all the shard files in an array

    Shard_Files_To_Be_Fed_Into_Step_7_GermlineCNVCaller=("${SplitIntervals_dir}"/*.interval_list)

    # Step 7: GermlineCNVCaller

    echo " "
    echo "Starting Step 7: GermlineCNVCaller"
    echo " "

    function run_GermlineCNVCaller() {

    localannotated_intervals_list=$1

    # Loop through each shard file path in Shard_Files_To_Be_Fed_Into_Step_7_GermlineCNVCaller

    forshard_filein"${Shard_Files_To_Be_Fed_Into_Step_7_GermlineCNVCaller[@]}"; do

    forcounts_argin"${Counts_Arguments_To_Be_Fed_Into_Step_7_GermlineCNVCaller[@]}"; do

    echo"Processing: $counts_arg with annotated intervals: ${annotate_dir}/Annotate_Intervals.tsv"

    shard_name=$(basename "$shard_file")

    singularitycommand="singularity exec $singularity_image /gatk/gatk"

    ${singularitycommand} GermlineCNVCaller \
    --contig-ploidy-calls"${DetermineGermlineContigPloidy_dir}/ploidy-calls"\
    -I"$counts_arg"\
    -L"$shard_file"\
    --annotated-intervals"${annotate_dir}/Annotate_Intervals.tsv"\
    -output"${GermlineCNVCaller_dir}"\
    --output-prefixStep_7___GermlineCNVCaller___${counts_arg}_${shard_name}______\
    --interval-merging-ruleOVERLAPPING_ONLY\
    --run-modeCOHORT\
    --verbosityDEBUG

    done

    done

    }

    run_GermlineCNVCaller

    # Wait for all background processes to finish

    wait
    The the PostprocessGermlineCNVCalls will be run but as the error is occurring at GermlineCNVCaller I have only included the code up to the GermlineCNVCaller.
    Why is this occurring? How would I begin to fix it? This is using the GATK 4.4.0.0 singularity. When I paste code here the spacing seems to significantly different from the way it appears in my script but the commands are the same.
    2
    Comment actions Permalink
  • Avatar
    Hayden Shinn

    SkyWarrior I am running into a similar issue as Y R..any chance you've seen this?

    2
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk