Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

(How to) Call rare germline copy number variants Follow


    Enrico Cocchi

    Is there any way to get the log2 output instead of the CN from PostprocessGermlineCNVCalls?

    Calvin Hung

    Hi, I believe the *.tsv files in the tutorial_11684.tar.gz either from the GoogleDrive or from the FTP site are deprecated and cannot run through GermlineCNVCaller since GATK v4.1.x.x. I managed to hack the format and fixed it myself. You might want to update the tutorial files as well.

    Ruqian Lyu


    Thanks for the great tutorial.

    I'm trying to run the pipeline for 300 low coverage samples (~5X). At the step of running GermlineCNVCaller, I'm seeing the tool keeps increasing the number of epochs because CNV calling is not converged. It is now at 50 epochs. Is this something expected or is it possible the optimisation procedure has been "trapped"  ?

    Ju Jose

    Thanks for the tutorial! Could you help me to understand her the NA19017.chr20sub.bam file was prepared? Is it just a BWA mapping reads? Does it got the sort and marked duplicates steps?

    I would like to know where I can get the following files:

    mappability-track regions file (in either .bed or .bed.gz format).
    segmental-duplication-track regions file (in either .bed or .bed.gz format).

    This link below is broken  from above.  Has there been an update with the Tutorial which matches the latest WDL pipeline?


    Download tutorial_11684.tar.gz either from the GoogleDrive or from the FTP site. The bundle includes data for Notebook #11685 and Notebook #11686. To access the ftp site, leave the password field blank. If the GoogleDrive link is broken, please let us know. The tutorial also requires the GRCh38 reference FASTA, dictionary and index. These are available from the GATK Resource Bundle. The example data is from the 1000 Genomes project Phase 3 aligned to GRCh38.

    Thanks for the tutorial! I have some troubles when using your tutorial to call CNV. Can you give me some suggestions? Here is my questions:

    Can't generate ploidy-calls directory and ploidy-calls/SAMPLE_0 when use DetermineGermlineContigPloidy – GATK (

    Marcela Martinez


    I am planning to run the Germline CNVs in docker on 106 targeted exomes. I am not planning to use the wld pipeline. I already run the Preprocessing step and I wonder how to run the next step, CollectReadsCounts over all those exams using the script below. Should I use a for loop to iterate over each bam sample and get every hdf5 sample result?

    In addition, is necessary to generate the cohort model on "normal samples" or it can be done on the same batch of affected ones?


    gatk CollectReadCounts \
              -I sample.bam \
              -L intervals.interval_list \
              --interval-merging-rule OVERLAPPING_ONLY \
              -O sample.counts.hdf5
    In step 4, the script contains all the input files. Running this script will take a lot of time. Is it possible to create a script for each input file? Will the results obtained be consistent with the results obtained from the script that contains all the input files?


    gatk GermlineCNVCaller \
            --run-mode COHORT \
            -L scatter-sm/twelve_1of2.interval_list \
            -I cvg/HG00096.tsv -I cvg/HG00268.tsv -I cvg/HG00419.tsv -I cvg/HG00759.tsv \
            -I cvg/HG01051.tsv -I cvg/HG01112.tsv -I cvg/HG01500.tsv -I cvg/HG01565.tsv \
            -I cvg/HG01583.tsv -I cvg/HG01595.tsv -I cvg/HG01879.tsv -I cvg/HG02568.tsv \
            -I cvg/HG02922.tsv -I cvg/HG03006.tsv -I cvg/HG03052.tsv -I cvg/HG03642.tsv \
            -I cvg/HG03742.tsv -I cvg/NA18525.tsv -I cvg/NA18939.tsv -I cvg/NA19017.tsv \
            -I cvg/NA19625.tsv -I cvg/NA19648.tsv -I cvg/NA20502.tsv -I cvg/NA20845.tsv \
            --contig-ploidy-calls ploidy-calls \
            --annotated-intervals twelveregions.annotated.tsv \
            --interval-merging-rule OVERLAPPING_ONLY \
            --output cohort24-twelve \
            --output-prefix cohort24-twelve_1of2 \
            --verbosity DEBUG


    Chris Pyatt

    I'm trying to run this on a WES cohort with 200k intervals, split into 5k groups by the scatter method described in section 4.2

    When I compare between scattered & non-scattered results (on a smaller subset that is not intractable to run whole), the segments called are not the same. I presume this is because I am missing any CNVs that span a boundary between scatter groups. How can I get around this?

    Thank you

    Masoumeh Gmoghadam


    I encountered an error in postprocess step;

    A USER ERROR has occurred: Couldn't read file /mnt/f/CNV-Cohort/fastq/CNV-Calling/ReadCounts/cohort1/cohort1_1of2-calls/interval_list.tsv. Error was: The input file does not exist.

    I`m wondering because there is an interval_list.tsv file in my cohort1_1of2-calls folder.

    I cant get the point of this error, can anybody help me?

    Gökalp Çelik

    Hi Masoumeh Gmoghadam

    If a GermlineCNVCaller step is completed without issues each shard should have an interval_list.tsv file within the shards folder. If it is non-existent either your GermlineCNVCaller step failed to complete or your paths in PostProcessGermlineCNVCalls parameters is incorrect.

    I hope these would help.


    Masoumeh Gmoghadam

    Thank you so much Gökalp Çelik


    Masoumeh Gmoghadam

    Unfortunately, my PostProcessGermlineCNVCalls step came up with another error:

    "Records were not strictly sorted in dictionary order"

    I`m really wondering why the order of the chromosomes in my ucsc.hg19.dict file is this way 1,2,3,4,5,6,7,X.8,9,10,11,12,13,14,15,16,17,18,20,Y,19,22,21, (decoys and M) . I modified the order to M,1to22,X,Y and deleted all of the decoy chromosomes in the dict. then I ran the pipeline again but the annotated interval is still similar to the previous interval. I checked everything. Is it sth wrong with my hg19 reference?

    Masoumeh Gmoghadam

    I used h5dump for one of my hdf5 files and the result was M,1to22,X,Y. Is it sorted? or its the actual order.

    Gökalp Çelik

    Hi Masoumeh Gmoghadam

    You need to make sure that your interval list, annotation file reference dictionary, and your bam files are all generated according to the same reference file. We recommend using hs37d5 or human reference hg38 that we have in our resource bundles. 

    If you are using something different make sure that all files are compatible to begin with. We do not recommend manually interfering the dictionary file. 

    Finally we also do not recommend running GCNV workflow outside of primary contigs for human samples.


    Masoumeh Gmoghadam

    Thanks for helping me, you are right and I think I found the solution.

    Best Regards.

