Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Call copy number variants with GermlineCNVCaller

0

25 comments

  • Avatar
    Gökalp Çelik

    Hi Isadora Machado Ghilardi

    When you create your scattered intervals for each scatter you need to run a separate instance of GermlineCNVCaller which will create shards of GCNV models and calls. These models and calls will then be fed to PostProcessGermlineCNVCalls to combine and generate a single call file for each sample. 

    When you generate your scattered intervallists make sure that

    1. None of your original intervals are split
    2. Scatters contain enough intervals to generate a converging model. 

    If your original intervals that you collect reads from are split then you will not be able to run the GermlineCNVCaller and will receive an error message indicating not all intervals are included in the readcount files. 

    I hope this helps. 

    0
    Comment actions Permalink
  • Avatar
    Isadora Machado Ghilardi

    Hi Gokalp

    I created my scatter using this command: 

    picard IntervalListTools --INPUT cohort.gc.interval_list --SUBDIVISION_MODE interval_count --SCATTER_CONTENT 5000 --OUTPUT scatter

    Is the same command that the gatk recommended. 

    And the other setps I`m following this:https://gatk.broadinstitute.org/hc/en-us/articles/360035531152--How-to-Call-rare-germline-copy-number-variants#top

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Looks fine. If you observe any issues with convergence which might happen given the low number of samples and low number of intervals in each scatter you may reduce the number of scatters eventually. 

     

    0
    Comment actions Permalink
  • Avatar
    Isadora Machado Ghilardi

    Okay! Thank you!

    Just one more question, to call copy number variants, this is the example code:

    gatk GermlineCNVCaller \ --run-mode COHORT \ -L scatter-sm/twelve_1of2.interval_list \ -I cvg/HG00096.tsv -I cvg/HG00268.tsv -I cvg/HG00419.tsv... \ --contig-ploidy-calls ploidy-calls \ --annotated-intervals twelveregions.annotated.tsv \ --interval-merging-rule OVERLAPPING_ONLY \ --output cohort24-twelve \ --output-prefix cohort24-twelve_1of2 \ --verbosity DEBUG

    Given the large number of scatter outputs, what should I use in the -L and --annotated-intervals?
    I know my question sounds silly, but I got really confused in this section.

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    You can use the same annotated intervals file for each scatter. Tool will pick suitable annotations directly. For the -L each instance will get one of the scatters you produced with IntervalListTools. 

    0
    Comment actions Permalink
  • Avatar
    Samuel Quaynor

    hello Gökalp Çelik and Isadora Machado Ghilardi,

    I have read the conversations you've had so far on the files produced by PicardIntervalList and the use of the files to call gCNVs by GermlineCNVCaller.
    I have also generated 20 files with the PicardIntervalList, however, I am confused as to what to do next. Should I use the GermlineCNVCaller on each of the temp interval list, thus resulting in 20 runs of the GermlineCNVcaller?

    I'm grateful for your inputs.

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Samuel Quaynor

    If you are planning on a scattered GCNV workflow, you need to perform GermlineCNVCaller step on each interval_list that you produce resulting in that number of GCNV models and calls creation. In the later step of PostProcessGermlineCNVCalls you will have a chance to combine them into a single callset by specifying each GCNV model and calls directories as input. 

    I hope this helps. 

    0
    Comment actions Permalink
  • Avatar
    Samuel Quaynor

    Thank you for the clarity Gökalp Çelik. I run the below command for my samples but I ended up with an error

     

     

    gatk GermlineCNVCaller \

    --run-mode COHORT \

    -L cnv/gatk_cnv/scatter/temp_0001_of20/scattered.interval_list \

    -I cnv/gatk_cnv/read_counts/GH20130774-1.counts.hdf5 \

    -I cnv/gatk_cnv/read_counts/GH20130800-1.counts.hdf5 \

    -I cnv/gatk_cnv/read_counts/GH20130842-1.counts.hdf5 \

    -I cnv/gatk_cnv/read_counts/GH20130817-1.counts.hdf5 \

    -I cnv/gatk_cnv/read_counts/GH20130873-1.counts.hdf5 \

    -I cnv/gatk_cnv/read_counts/GH20130857-1.counts.hdf5 \

    -I cnv/gatk_cnv/read_counts/GH20130815-1.counts.hdf5 \

    -I cnv/gatk_cnv/read_counts/GH20130796-1.counts.hdf5 \

    -I cnv/gatk_cnv/read_counts/GH20130847-1.counts.hdf5 \

    -I cnv/gatk_cnv/read_counts/GH20130801-1.counts.hdf5 \

    --contig-ploidy-calls cnv/gatk_cnv/ploidy/ploidy-calls \

    --annotated-intervals annotated_intervals.tsv \

    --interval-merging-rule OVERLAPPING_ONLY \

    --output cohort20 \

    --output-prefix cohort20_1of20 \

    --verbosity DEBUG





    The error I encountered is: 

    A USER ERROR has occurred: Badly formed genome unclippedLoc: Query interval "cnv/gatk_cnv/scatter/temp_0001_of20/scattered.interval_list" is not valid for this input.

     

    0
    Comment actions Permalink
  • Avatar
    Samuel Quaynor

    I wanted to also know if I could just run the GermlineCNV without creating the scatters?

     

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    You can run  GermlineCNVCaller with or without scattered intervals it is not a problem. 

    For the above issue how did you create those scattered intervals? Can you provide us the command you used to do that?

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Samuel Quaynor

    Hello Gökalp Çelik,

    below is the command I used in creating the scatter

    gatk IntervalListTools \

    --INPUT filtered_intervals.interval_list \

    --SUBDIVISION_MODE INTERVAL_COUNT \

    --SCATTER_CONTENT 10000 \

    --OUTPUT scatter \
    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Can you check if the path to those scattered files is correct? That error message indicates that somehow. 

    0
    Comment actions Permalink
  • Avatar
    Samuel Quaynor

    Thank you Gökalp Çelik. I have been able to figure out what the problem was. I made a mistake in the name of the directory. I typed 

    temp_0001_of20/scattered.interval_list instead of
    temp_0001_of_20/scattered.interval_list

    However, after running, I encountered another error:
    cohort_denoising_calling.5209905922033502817.py: error: unrecognized arguments: --random_seed=1984 --num_samples_copy_ratio_approx=200
    14:09:06.035 INFO  GermlineCNVCaller - Shutting down engine

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Can you tell us your steps until this part? Are you running this command within suggested GATK conda environment?

    0
    Comment actions Permalink
  • Avatar
    Samuel Quaynor

    I created a conda environment and downloaded GATK version 4.3.0.0 and python3.6.10. I then downloaded gcnvkernel and all it's dependencies. I have been running the commands in this environment.

    Initially, I had another environment with GATK v4.5.0.0, however, that packages in that environment wouldn't allow for the installation of gcnvkernel from conda so I created the environment for GATK v4.3

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Our recommendation is to use the yml file provided within the gatk bundle to create the conda environment. Hand crafted conda environments might end up different behaviors that we cannot debug. Can you remove all environments that you created for gatk and install the environment using the yml file like below.

    conda env create -f gatkcondaenv.yml
    0
    Comment actions Permalink
  • Avatar
    Samuel Quaynor

    please can you provide me the link to the yml file?

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    It can be found inside the release zip file downloaded from the main page. 

    https://github.com/broadinstitute/gatk/releases/download/4.5.0.0/gatk-4.5.0.0.zip 

    0
    Comment actions Permalink
  • Avatar
    Samuel Quaynor

    Gökalp Çelik, if I create the new environment with the gatkcondaenv.yml, does it mean I'd have to rerun the commands from the beginning (i.e PreprocessIntervals) or I can continue from where the error occurred?

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Commands that require gcnvkernel and the python environment are 

    DetermineGermlineContigPloidy

    GermlineCNVCaller

    PostProcessGermlineCNVCalls

    Others can run outside of the conda environment and should work just fine. 

    0
    Comment actions Permalink
  • Avatar
    Samuel Quaynor

    Gökalp Çelik, thanks for the yaml file. I have created the environment with the yaml file and run a pip on the pyhtonpackage.zp. How do I install gatk now since it is not part of the packages in the yaml or the zip (so sorry if this sounds stupid, I am new to this :))?

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Just place the gatk python script and 2 jar files to a place inside the PATH and it should be fine. 

    0
    Comment actions Permalink
  • Avatar
    Samuel Quaynor

    Thank you so much Gökalp Çelik, I have been able to do that and successfully call the copy numbers.
    Grateful for all the help

    A snapshot of part of the segmental call:

    0
    Comment actions Permalink
  • Avatar
    Samuel Quaynor

    Hello Gökalp Çelik,

    Are there standard quality thresholds for QA and QS for the segmented CNV calls? I want to filter the calls based on the qualities but I am yet to find what the threshold qualities are for QA and QS.

    I appreciate all the help thus far

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi again.

    You can check the paper from the link below.

    https://www.biorxiv.org/content/10.1101/2022.08.25.504851v1.full.pdf 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk