Call copy number variants with GermlineCNVCaller
REQUIRED for all errors and issues:
a) GATK version used: 4.3
I`m following this pipeline here: https://gatk.broadinstitute.org/hc/en-us/articles/360035531152--How-to-Call-rare-germline-copy-number-variants#top
I have 18 samples, and I was able to generate the ploidy calls and model.
I was able to divide my cohort.gc.filtered.interval_list into scatters. Using the picard IntervalListTools, however, I obtained 34 temp files...(e.g., temp_001_of_34), and I am unsure how to call the GermlineCNVCaller with these files, because in the example, the 24 samples were divided into two scatter intervals.
-
When you create your scattered intervals for each scatter you need to run a separate instance of GermlineCNVCaller which will create shards of GCNV models and calls. These models and calls will then be fed to PostProcessGermlineCNVCalls to combine and generate a single call file for each sample.
When you generate your scattered intervallists make sure that
- None of your original intervals are split
- Scatters contain enough intervals to generate a converging model.
If your original intervals that you collect reads from are split then you will not be able to run the GermlineCNVCaller and will receive an error message indicating not all intervals are included in the readcount files.
I hope this helps.
-
Hi Gokalp
I created my scatter using this command:
picard IntervalListTools --INPUT cohort.gc.interval_list --SUBDIVISION_MODE interval_count --SCATTER_CONTENT 5000 --OUTPUT scatter
Is the same command that the gatk recommended.
And the other setps I`m following this:https://gatk.broadinstitute.org/hc/en-us/articles/360035531152--How-to-Call-rare-germline-copy-number-variants#top -
Looks fine. If you observe any issues with convergence which might happen given the low number of samples and low number of intervals in each scatter you may reduce the number of scatters eventually.
-
Okay! Thank you!
Just one more question, to call copy number variants, this is the example code:gatk GermlineCNVCaller \ --run-mode COHORT \ -L scatter-sm/twelve_1of2.interval_list \ -I cvg/HG00096.tsv -I cvg/HG00268.tsv -I cvg/HG00419.tsv... \ --contig-ploidy-calls ploidy-calls \ --annotated-intervals twelveregions.annotated.tsv \ --interval-merging-rule OVERLAPPING_ONLY \ --output cohort24-twelve \ --output-prefix cohort24-twelve_1of2 \ --verbosity DEBUG
Given the large number of scatter outputs, what should I use in the -L and --annotated-intervals?
I know my question sounds silly, but I got really confused in this section. -
You can use the same annotated intervals file for each scatter. Tool will pick suitable annotations directly. For the -L each instance will get one of the scatters you produced with IntervalListTools.
-
hello Gökalp Çelik and Isadora Machado Ghilardi,
I have read the conversations you've had so far on the files produced by PicardIntervalList and the use of the files to call gCNVs by GermlineCNVCaller.
I have also generated 20 files with the PicardIntervalList, however, I am confused as to what to do next. Should I use the GermlineCNVCaller on each of the temp interval list, thus resulting in 20 runs of the GermlineCNVcaller?
I'm grateful for your inputs. -
If you are planning on a scattered GCNV workflow, you need to perform GermlineCNVCaller step on each interval_list that you produce resulting in that number of GCNV models and calls creation. In the later step of PostProcessGermlineCNVCalls you will have a chance to combine them into a single callset by specifying each GCNV model and calls directories as input.
I hope this helps.
-
Thank you for the clarity Gökalp Çelik. I run the below command for my samples but I ended up with an error
gatk GermlineCNVCaller \
--run-mode COHORT \
-L cnv/gatk_cnv/scatter/temp_0001_of20/scattered.interval_list \
-I cnv/gatk_cnv/read_counts/GH20130774-1.counts.hdf5 \
-I cnv/gatk_cnv/read_counts/GH20130800-1.counts.hdf5 \
-I cnv/gatk_cnv/read_counts/GH20130842-1.counts.hdf5 \
-I cnv/gatk_cnv/read_counts/GH20130817-1.counts.hdf5 \
-I cnv/gatk_cnv/read_counts/GH20130873-1.counts.hdf5 \
-I cnv/gatk_cnv/read_counts/GH20130857-1.counts.hdf5 \
-I cnv/gatk_cnv/read_counts/GH20130815-1.counts.hdf5 \
-I cnv/gatk_cnv/read_counts/GH20130796-1.counts.hdf5 \
-I cnv/gatk_cnv/read_counts/GH20130847-1.counts.hdf5 \
-I cnv/gatk_cnv/read_counts/GH20130801-1.counts.hdf5 \
--contig-ploidy-calls cnv/gatk_cnv/ploidy/ploidy-calls \
--annotated-intervals annotated_intervals.tsv \
--interval-merging-rule OVERLAPPING_ONLY \
--output cohort20 \
--output-prefix cohort20_1of20 \
--verbosity DEBUGThe error I encountered is:
A USER ERROR has occurred: Badly formed genome unclippedLoc: Query interval "cnv/gatk_cnv/scatter/temp_0001_of20/scattered.interval_list" is not valid for this input.
-
I wanted to also know if I could just run the GermlineCNV without creating the scatters?
-
You can run GermlineCNVCaller with or without scattered intervals it is not a problem.
For the above issue how did you create those scattered intervals? Can you provide us the command you used to do that?
Regards.
-
Hello Gökalp Çelik,
below is the command I used in creating the scatter
gatk IntervalListTools \
--INPUT filtered_intervals.interval_list \
--SUBDIVISION_MODE INTERVAL_COUNT \
--SCATTER_CONTENT 10000 \
--OUTPUT scatter \ -
Can you check if the path to those scattered files is correct? That error message indicates that somehow.
-
Thank you Gökalp Çelik. I have been able to figure out what the problem was. I made a mistake in the name of the directory. I typed
temp_0001_of20/scattered.interval_list instead of
temp_0001_of_20/scattered.interval_list
However, after running, I encountered another error:
cohort_denoising_calling.5209905922033502817.py: error: unrecognized arguments: --random_seed=1984 --num_samples_copy_ratio_approx=200
14:09:06.035 INFO GermlineCNVCaller - Shutting down engine -
Can you tell us your steps until this part? Are you running this command within suggested GATK conda environment?
-
I created a conda environment and downloaded GATK version 4.3.0.0 and python3.6.10. I then downloaded gcnvkernel and all it's dependencies. I have been running the commands in this environment.
Initially, I had another environment with GATK v4.5.0.0, however, that packages in that environment wouldn't allow for the installation of gcnvkernel from conda so I created the environment for GATK v4.3 -
Our recommendation is to use the yml file provided within the gatk bundle to create the conda environment. Hand crafted conda environments might end up different behaviors that we cannot debug. Can you remove all environments that you created for gatk and install the environment using the yml file like below.
conda env create -f gatkcondaenv.yml
-
please can you provide me the link to the yml file?
-
It can be found inside the release zip file downloaded from the main page.
https://github.com/broadinstitute/gatk/releases/download/4.5.0.0/gatk-4.5.0.0.zip
-
Gökalp Çelik, if I create the new environment with the gatkcondaenv.yml, does it mean I'd have to rerun the commands from the beginning (i.e PreprocessIntervals) or I can continue from where the error occurred?
-
Commands that require gcnvkernel and the python environment are
DetermineGermlineContigPloidy
GermlineCNVCaller
PostProcessGermlineCNVCallsOthers can run outside of the conda environment and should work just fine.
-
Gökalp Çelik, thanks for the yaml file. I have created the environment with the yaml file and run a pip on the pyhtonpackage.zp. How do I install gatk now since it is not part of the packages in the yaml or the zip (so sorry if this sounds stupid, I am new to this :))?
-
Just place the gatk python script and 2 jar files to a place inside the PATH and it should be fine.
-
Thank you so much Gökalp Çelik, I have been able to do that and successfully call the copy numbers.
Grateful for all the helpA snapshot of part of the segmental call:
-
Hello Gökalp Çelik,
Are there standard quality thresholds for QA and QS for the segmented CNV calls? I want to filter the calls based on the qualities but I am yet to find what the threshold qualities are for QA and QS.
I appreciate all the help thus far -
Hi again.
You can check the paper from the link below.
https://www.biorxiv.org/content/10.1101/2022.08.25.504851v1.full.pdf
Please sign in to leave a comment.
25 comments