Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenomeStrip: CNVs in segmentally duplicated regions

Answered
0

3 comments

  • Avatar
    Bob Handsaker

    Hi, Christian,

    To genotype regions with non-unique sequence on the reference with Genome STRiP (using read depth of coverage), you need to create a VCF file describing the segments you want to genotype (and how to combine them). For the 2015 paper, we generated the M2 sites based on the segdup annotations from UCSC, then prospectively genotyped these putative sites and filtered for only sites with clean CNVs. I put an example of how to generate the prospective site VCF for human_g1k_v37 here:

    ftp://ftp.broadinstitute.org/pub/svtoolkit/segdups/

    There is also a Queue pipeline in the Genome STRiP release under qscript/discovery/segdup that has more information, although I can't promise the Queue scripts are up to date.

    The tags in the VCF records (specifically DUPINTERVALS and SVTYPE) enable the genotyping against non-unique portions of the reference.

    As to the question about the kmer size for the alignability mask:  Ideally if you are using longer reads, you should build a mask with a larger kmer size. We typically use k=101 for reads 100bp and up (k larger than has negligible effect). In practice, the difference between k=36 and k=101 will not be too significant - you may lose some small amount of genotyping power/accuracy on paralogous regions that are highly diverged (roughly between 1% and 3% divergence).

    1
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Bob Handsaker, thank you for your input!

    0
    Comment actions Permalink
  • Avatar
    chrl

    Thanks Bob Handsaker,

    I wasn't aware of the SegDupDiscoveryPipeline.q pipeline, but I was able to run the analysis with the human_g1k_v37 segdup files that you provided, and it seems that it worked perfectly!
    Thanks for helping out, and for a super quick response!

    Best regards

    Christian

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk