Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Detecting sample swaps with Picard tools Follow

7 comments

  • Avatar
    Yossi Farjoun

    We finally got approval to make public the full "haplotype database" that we use in production. These are designed for human samples:

    gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.with_assay_info.haplotype_database.txt

    gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.with_assay_info.haplotype_database.txt

     

    Note that some assay information is included in these files and we are allowed to say that at the Broad we were using the sites listed under "fluidigm_96plex_v1".

     

    I hope this is useful for someone.

     

     

    2
    Comment actions Permalink
  • Avatar
    Yossi Farjoun

    We finally got the thumbs-up to publish the actual snp list that we use for fingerprinting....you can find it here.

     

    Also, in case you were wondering what system we are using for assaying the samples...here's the description:

    "The Broad utilizes Fluidigm Corporations BioMark HD dPCR system. This platform is based on microfluidic chip technology, requiring less sample and reagents while still achieving high-quality, consistent results. Custom primers are used throughout the process. A multiplexed pre-amplification step (STA) allows for target enrichment. Samples are diluted post-amplification and added to the Sample Loading Reagent (SLR). Assay plates are prepared and the IFC (Integrated Fluidic Circuit) Chip is primed utilizing the IFC Controller HX. Sample and Assay are both added manually into the wells on either side of the IFC. Samples are loaded into the wells on the right side of the chip; assay reagent
    is loaded into the wells on the left side of the chip. Once all wells are loaded, the chip is again placed into the IFC Controller for loading. Once loading is completed, the chip is placed in the FC1 thermocyclers for amplification, followed by scanning on the Biomark HD. Scoring and analysis is performed utilizing the Genotype Analysis Software from Fluidigm."

     

     

    0
    Comment actions Permalink
  • Avatar
    Beatriz Moleirinho

    Hello Dr Yossi Farjoun,

     

    I was very interested in trying your approach for detection of sample swaps in some ATAC-seq data we are analyzing in our lab. I was wondering if you have a docker with the tools used and if you had ever tried this approach for ATAC-seq.

     

    Thanks in advance for the time dispended. Best regards

    0
    Comment actions Permalink
  • Avatar
    Yossi Farjoun

    Hello Beatriz Moleirinho

    Thanks for reaching out. in the paper listed above we used fingerprinting on DNAse, CHiP-Seq and RNA-seq, but not ATAC-seq. I don't see a reason why it shouldn't work as long as you obtain DNA that contains alleles from variation in the DNA. the Broad's pipelines can be found in https://github.com/broadinstitute/warp and https://dockstore.org/ and they also have dockers to go with them. The paper points to a resource that was optimized for sparse data such as what you have https://github.com/naumanjaved/fingerprint_maps so you can use that for the Haplotype maps!   

    Best of luck!

     

    0
    Comment actions Permalink
  • Avatar
    Pubudu Nawarathna

    Hello Dr Farjoun,

    Thanks for the great explanation. I was wondering whether there is a method to detect sample swap if the samples are from the same individuals but with different conditions (e.g treatment vs control). How we can assure that readgroups from multiplexed samples are from the same condition (given that all the reads are from the same individual).

    0
    Comment actions Permalink
  • Avatar
    Yossi Farjoun

    Hello Pubudu,

    Thanks for writing in.

    Methods for detecting different conditions would depend on the conditions and type of data. I can conceive of a way to do this for RNA by clustering the coverage information (per gene) for the different read-groups after doing dimensional reduction with PCA or some-such, but I don't have a ready-to-go script or method of doing this. Could be a nice addition to the field!

     

     

    0
    Comment actions Permalink
  • Avatar
    Chadi Saad

    Hi Yossi,

    To use "CheckFingerprint", do we need to build the HAPLOTYPE_MAP from our population ? we have a multisample VCF (not phased) containing thousands of samples. Or it's ok to use the public database "Homo_sapiens_assembly38.with_assay_info.haplotype_database.txt" listed above ?

    If we have to build it, is there any tool to build it from the msVCF ?

    Also, if no concordance between array and WGS, is there a way to scan quickly the whole population and find the correspondant WGS for a specific array ?

    Thanks

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk