Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GATK CNVs Ploidy Interpretation


1 comment

  • Avatar

    Hi Marcela Martinez

    Inside the DetermineContigPloidy output folders per sample there is a file named contig_ploidy.tsv. That file has all the ploidy estimators per contig for your samples. 

    To be sure that those samples really have these abnormal ploidies you may need to use some other methods to ensure the validity of the findings. 

    One such method is to check the allelic balances of your abnormal contigs. If your contigs really have copy alterations then you should observe skewed Ref/Alt counts at your high quality biallelic positions. Another way to check this would be to collect reads that map to all your contigs using samtools idxstats and calculate the mean or median numbers per contig. From there you should be able to calculate a z score or some other valid statistics to confirm the result. 

    However numbers have a tendency to lie when there are other variables. 

    Exome capture processes are not always equal for all samples and not all isolated DNA's are of equal quality. If you have a chance to collect HsMetrics using CollectHsMetrics tool please check mean and median depths of your samples as well as AT and GC bias metrics that are found at the far right side of the table. When you collect all HsMetrics of all your samples you will have a chance to compare them and see if there are outliers. AT or GC bias metrics tell you that how equally your ATs and GCs were captured and mapped during exome sequencing. Ideally you would want those numbers to be close to 1 or 2 (depending on the capture kit and target regions) but if you have samples that go above and beyond those then you will start observing your targets getting less efficiently captured, therefore GC rich or AT rich regions or contigs get lesser or more reads depending on the bias. The first set of chromosomes to show increased or decreased contig ploidy numbers are 19 16 17 20 22. These contigs seem to be rich for GC especially chromosome 19. When there is a tendency to capture more GC and less AT then your GC rich chromosomes tend to get more reads on them.

    This phenomenon also represents itself when you look at the difference between mean and median depth of your targets. If the difference is too high like 100X mean vs 60X median then your data is severely biased and you won't be able to get good quality CNV calls. 

    This difference also causes a false alarm during the DetermineContigPloidy step even if there is really no abnormal ploidy present. To solve this problem you may need to omit those samples or try to regroup samples with similar or close AT or GC bias together in analysis or just hand adjust your contig_ploidy.tsv files to continue your analyses. However you will still face the difficulty to eliminate many many false DEL DUP calls due to the bias. My best suggestion is to omit those samples and re sequence them with a better DNA sample and try getting less biased results. 

    Good luck. 

    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk