Question about using CNNScoreVariants, etc with targeted sequencing data
I'm interested in using CNNScoreVariants, etc with some targeted sequencing data (human data on 63 genes from about 400 samples) since the documentation says VQSR is not suitable for this and suggests either the CNN-based approach or hard filtering and that the CNN-based approach may be better. (I'm using GATK v. 4.1.4.1.) However, since I have cohort data I'm currently working with joint called cohort data and the documentation says the CNN-based approach is still experimental in that case (although it's established for single-sample data). I have a few different questions relating to this.
One workaround might be to call on all samples individually, then filter with the CNN-based approach (using the default models) and then combine the separate VCFs or gVCFs (depending it is possible to do joint calling at that stage). Is there anything wrong with that approach (both variations)? (I suspect it loses some of the advantages of joint calling maybe in both cases.)
I think I read somewhere although unfortunately I can't find the relevant documentation now (it may have been on the old GATK forum) that in order to use the CNN-based approach on joint callset data then I would have to train my own CNN model. So I've looked into trying to do this which requires running CNNVariantWriteTensors and CNNVariantTrain with the first program putting the data into a format suitable for CNNVariantTrain to work with, I think. I can't find any documentation on how to do this apart from the individual manual pages on those two programs (which include example command lines). I guessed that the input data for CNNVariantWriteTensors for the truth VCF might be similar to one or more of the resources used by VariantRecalibrator; but the example command line appears to use a Platinum Genomes file instead so maybe that wouldn't be possible. So my first attempts have used the Platinum Genomes "hybrid truthsets" VCF and BED files (for hg19) downloaded from the Illumina website instead. Is this going to be suitable?
William
-
Hi ,
The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.
Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.
We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.
For context, check out our support policy.
Please sign in to leave a comment.
1 comment