Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GATK Germline CNV calling

0

8 comments

  • Avatar
    Gökalp Çelik

    Hi Ramesh

    GCNV has 2 separate modes both of which may serve for different purposes. COHORT mode runs without a prior model and generates the model on the fly as it detects and calculates CNVs. If you have all your samples run in COHORT mode you don't have to re-run them in CASE mode again. COHORT mode usually requires high number of samples to run preferably 30 or more. 

    CASE mode is useful when you have low number of samples and a comparable model that you generated previously from a normal cohort. 

    Looking at the way you generated your flow (though we cannot fix any scriptual errors here) I spotted a few tools that are not part of the actual GCNV workflow. Those tools belong to Somatic CNV analysis workflow so you can remove them. 

    You may refer to our chart below

    About your question on interval lists you can prepare one yourself that suits your capture kit and/or reference genome using the tool PreProcessIntervals. 

    Finally we have a publication on GCNV workflow on UKBB exomes which you may refer to when filtering exome CNV calls.

    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10904014/ 

    I hope this helps. 

    0
    Comment actions Permalink
  • Avatar
    Ramesh

    Thank you so much for replying and the flow chart Gökalp Çelik. I will remove the unnecessary tools/steps. I was testing the pipeline for both cases and hence was running it on CASE and COHORT. Ideally we would be running in COHORT mode. 

    We are trying to identify CNVs in our WGS data and sometimes we might run it on WES. Are there any specific differences in the process between the two other than interval definition and the command in preprocess interval step?

    Thank you for your time and guidance.

     

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi again.

    Those 2 different sample types might require different parameters for tuning. Our defaults in GATK 4.5 and above are mostly tuned for whole exome data in the study published in the link below

    https://www.nature.com/articles/s41588-023-01449-0 

    Whole genome sequencing samples could be tuned with parameters indicated in the document below.

    https://gatk.broadinstitute.org/hc/en-us/articles/360035531152--How-to-Call-rare-germline-copy-number-variants 

    I hope this helps.

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Ramesh

    Thanks again Gökalp Çelik.

    Can you also kindly clarify about the interval list and ploidy priors.

    So, I'm currently using wgs_calling_regions.hg38.interval_list  from GATK's resource bundle.

    I have excluded all the non-standard and random chromosomes and kept just chr1:22, X and Y

    My ploidy priors file is: 

    CONTIG_NAME PLOIDY_PRIOR_0 PLOIDY_PRIOR_1 PLOIDY_PRIOR_2 PLOIDY_PRIOR_3

    chr1 0 0.01 0.98 0.01

    chr2 0 0.01 0.98 0.01

    chr3 0 0.01 0.98 0.01

    chr4 0 0.01 0.98 0.01

    chr5 0 0.01 0.98 0.01

    chr6 0 0.01 0.98 0.01

    chr7 0 0.01 0.98 0.01

    chr8 0 0.01 0.98 0.01

    chr9 0 0.01 0.98 0.01

    chr10 0 0.01 0.98 0.01

    chr11 0 0.01 0.98 0.01

    chr12 0 0.01 0.98 0.01

    chr13 0 0.01 0.98 0.01

    chr14 0 0.01 0.98 0.01

    chr15 0 0.01 0.98 0.01

    chr16 0 0.01 0.98 0.01

    chr17 0 0.01 0.98 0.01

    chr18 0 0.01 0.98 0.01

    chr19 0 0.01 0.98 0.01

    chr20 0 0.01 0.98 0.01

    chr21 0 0.01 0.98 0.01

    chr22 0 0.01 0.98 0.01

    chrX 0.01 0.49 0.49 0.01

    chrY 0.495 0.495 0.01 0

    When I run the GermLineCNVCaller in cohort mode, I get the warning:

    15:21:38.476 WARNING gcnvkernel.structs.metadata - Sample 2908946 has an anomalous ploidy (3) for contig chr1. The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.

    15:21:38.493 WARNING gcnvkernel.structs.metadata - Sample 2908956 has an anomalous ploidy (3) for contig chr1. The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.

    15:21:38.494 WARNING gcnvkernel.structs.metadata - Sample 2908956 has an anomalous karyotype ({'X': 3, 'Y': 0}). The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.

    15:21:38.512 WARNING gcnvkernel.structs.metadata - Sample 2908966 has an anomalous ploidy (3) for contig chr1. The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.

    Now my bam and ref files include the non-standard chromosomes. Is this causing the issue?

    What can I do it avoid this? Should I remove the reads from non standard chromosomes from my bam files? what is the recommended way to handle this? I tried to get mappability tracks from UCSC but am unsure on how and which step of the CNV analysis to use.

    This is my current command:

    $gatk GermlineCNVCaller \

        -L ${MAIN_DIR}/SUPP_FILES/REFERENCES/${reference}/wgs_calling_regions_hg38_filtered.interval_list \

        --interval-merging-rule OVERLAPPING_ONLY \

        --run-mode COHORT \

        --contig-ploidy-calls ${data}/ploidy-calls \

        $COHORT_INPUT \

        --annotated-intervals ${MAIN_DIR}/SUPP_FILES/REFERENCES/${reference}/wgs_calling_regions_hg38_annotated_interval_list.tsv \

        --output ${data}/CNVcalls \

        --output-prefix cohort \

        --verbosity DEBUG

    Thank you again for patiently guiding me through this.

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Ramesh

    Ploidy priors are empirical values that could represent your data as baseline probability for aneuploidies. If you wish to keep your samples more in the diploid state and only accept non-marginal differences as aneuploidies you may need to increase the prior probability of the diploid state and reduce all others. Regardless GermlineCNVCaller step will reassess and report any copy number estimations regardless of your ploidy state. 

    For the intervals we recommend using intervals that reside in the more of the mappable parts of the genome which can be filtered out using additional mappability and segmental duplications annotations. 

    I hope this helps. 

    0
    Comment actions Permalink
  • Avatar
    Ramesh

    Thank you for that clarification Gökalp Çelik. Are there any specific mappability tracks or annotations recommended as best practice?

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi again. 

    You may want to refer to Hoffman labs pages for mappability and UCSC segmental duplications page for segmental duplication annotations.

    https://bismap.hoffmanlab.org/

    https://genome.ucsc.edu/cgi-bin/hgTrackUi?g=genomicSuperDups 

    I hope this helps.

    Regards. 

    1
    Comment actions Permalink
  • Avatar
    Ramesh

    Thank you so much! 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk