gCNV - Discrepancy in Results between Hg19 and Hg38 Cohort Models
Hi,
I'm facing a puzzling issue with my cohort model, and I was hoping to get some insights from the community.
I've successfully created a cohort model using 30 samples that were sequenced with the same kit for Hg38. However, the results are not aligning with the metadata as expected. Interestingly, when I performed the same analysis for Hg19 using the identical procedure outlined in the GATK documentation (How to) Call common and rare germline copy number variants – GATK (broadinstitute.org), the results matched seamlessly with the metadata.
Just to provide more context, the samples were WES data, totaling 60 samples—30 for Hg19 (sequenced in a single run) and 30 for Hg38 (not sequenced in a single run but with the same capture kit). The samples were collected from different locations, and I have metadata for the case samples to compare the results generated by different callers, specifically DRAGEN.
While GATK gCNV for Hg19 yielded matching results with the metadata, the same cannot be said for Hg38. I'm aware that there might be algorithmic differences between the two CNV caller, but I'm struggling to pinpoint the exact issue.
GATK version: 4.4.0.0
BED file: Twist_Comprehensive_Exome_Covered_Targets_hg38/hg19.bed
Has anyone encountered a similar situation or can offer some guidance on how to troubleshoot this? I appreciate any assistance or insights you can provide!
Thank you,
Joshua
-
Hi Joshua Ravi
Is the metadata that you mention a known CNVs of those samples or is it something that was also created by other callers?
One thing that is for sure important is the usage of the exact reference genome for your comparisons with known results. DRAGEN uses a custom masked hg38 reference for its mapping and secondary analysis whereas if you are using the default hg38 reference genome with alt contigs and HLA without additional masking or alt aware mapping it may be possible that you may get different calls for CNVs.
Can you make sure that whatever genome you are using for hg38 is the same reference genome that DRAGEN uses?
One issue that may also plague your samples for calls would be the quality of captures generated by different labs. Each capture is unique in its ways therefore sometimes mixing samples from multiple different labs or runs may end up with results in unexpected ways. This has been my personal experience as well and to overcome this issue the best method is to collect as many samples as you can and also check AT and GC dropout rates of samples using CollectHsMetrics tool. As AT and GC dropout rates differ samples will start showing unexpected CNV behavior. For this reason I highly recommend you check these parameters as well.
I hope these help.
Regards.
-
Thanks Gökalp Çelik
The metadata was generated by another CNV caller, and we utilized the same reference genome. For the Hg19 cohort model, all samples originated from a single lab and a single run. In the case of the Hg38 cohort model, the samples were sourced from a single lab but not from a single run. It's important to note that the samples were not mixed, and distinct models were constructed for each genome build.
I am seeking guidance on the interpretation of the CollectHsMetrics output.txt file. Any insights or tips on how to effectively decipher this file would be greatly appreciated.
Thanks in advance for your assistance!
Best regards,
Joshua
-
Hi Joshua Ravi
CollectHsMetrics file is a tab seperated file in principle therefore you may open it with any spreadsheet editor and get the columns and values for each column clearly. Explanation of each column can be found down in the link
https://broadinstitute.github.io/picard/picard-metric-definitions.html#HsMetrics
It is expected that not all CNV callers will result in absolutely similar results however if you have a highly divergent result for only hg38 dataset that means some parameters don't match with your own analysis vs the metadata generating center's analysis. We may need more details of how metadata is generated and whether you performed a secondary analysis for mapping and alignment for yourself other than the metadata generating center. Also we may ask you to provide some examples of how the metadata differs from your own calls using gCNV.
Regards.
Please sign in to leave a comment.
3 comments