GenomeStrip: CNVs in segmentally duplicated regions
AnsweredHi,
Version: Genome STRiP Release 2.00.1982
I have used the GenomeStrip tools SVPreprocess and CNVDiscoveryPipeline to call CNVs from WGS data, but it seems that no calls are made for segmentally duplicated/paralogous regions by default.
Are there any options to call CNVs of paralogous regions (similar to 'CNV discovery set 2' and the 'M2' method in the GenomeStrip paper from 2015)?
Another question: I have used the Reference Genome Metadata provided for the human_g1k_v37 reference with a 'genome alignability mask that uses a k-mer size of 36bp'. Will the k-mer affect the CNV calls when using it with 100 bp or 150 bp paired-end WGS data mapped to human_g1k_v37?
Tagging Bob Handsaker
Best regards
Christian
-
Hi, Christian,
To genotype regions with non-unique sequence on the reference with Genome STRiP (using read depth of coverage), you need to create a VCF file describing the segments you want to genotype (and how to combine them). For the 2015 paper, we generated the M2 sites based on the segdup annotations from UCSC, then prospectively genotyped these putative sites and filtered for only sites with clean CNVs. I put an example of how to generate the prospective site VCF for human_g1k_v37 here:
ftp://ftp.broadinstitute.org/pub/svtoolkit/segdups/
There is also a Queue pipeline in the Genome STRiP release under qscript/discovery/segdup that has more information, although I can't promise the Queue scripts are up to date.
The tags in the VCF records (specifically DUPINTERVALS and SVTYPE) enable the genotyping against non-unique portions of the reference.
As to the question about the kmer size for the alignability mask: Ideally if you are using longer reads, you should build a mask with a larger kmer size. We typically use k=101 for reads 100bp and up (k larger than has negligible effect). In practice, the difference between k=36 and k=101 will not be too significant - you may lose some small amount of genotyping power/accuracy on paralogous regions that are highly diverged (roughly between 1% and 3% divergence).
-
Bob Handsaker, thank you for your input!
-
Thanks Bob Handsaker,
I wasn't aware of the SegDupDiscoveryPipeline.q pipeline, but I was able to run the analysis with the human_g1k_v37 segdup files that you provided, and it seems that it worked perfectly!
Thanks for helping out, and for a super quick response!Best regards
Christian
Please sign in to leave a comment.
3 comments