Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

How do I import my data into notebook for germline filtering?

Answered
0

11 comments

  • Avatar
    Pamela Bretscher

    Hi J LoPiccolo,

    I am going to move your post into our Community Discussions -> General Discussion topic, as the germline topic is for reporting bugs and issues with GATK.

    You can read more about our forum guidelines and the topics here: Forum Guidelines.

    Best,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Samantha (she/her)

    Hi J LoPiccolo,

     

    Are your VCF files stored in your workspace bucket? The Jupyter notebook VM storage is separate from your workspace bucket, so in order to analyze data from your workspace bucket in a notebook, you'll need to import the data to the notebook VM. Here's an article that goes over how to do so: Analyzing data from a workspace bucket in a notebook.

    You may also want to consider copying the files over to your persistent disk, rather than just the notebook VM. The persistent disk allows you to keep important analysis data (installed packages, input, output data) in the event that you need to delete and re-create your cloud environment.

    Please let me know if you have any questions.

     

    Best,

    Samantha

    0
    Comment actions Permalink
  • Avatar
    J LoPiccolo

    Thank you! Will it matter where in my workspace bucket the VCFs are located (for example each one is in a different folder), or will the filtering algorithm in the notebook be able to pull them all regardless of location?

    0
    Comment actions Permalink
  • Avatar
    Samantha (she/her)

    Hi J LoPiccolo,

    I'm not sure if I understand your question. You'll need to copy all files you want to analyze in your notebook from your workspace bucket over to your notebook VM storage or persistent disk, and you can organize the files however you want. It shouldn't matter what folders they are in, as long as the commands in your notebook know what data it needs to run on.

    If you'd like us to take a closer look at your notebook, please share your workspace with GROUP_FireCloud-Support@firecloud.org and let us know the name of your workspace.

    Best,

    Samantha

    0
    Comment actions Permalink
  • Avatar
    J LoPiccolo

    Thanks, the name of the workspace is PROACTIVE-WGS_JL correct billing acct 

    I have shared it with you. Thank you!

    0
    Comment actions Permalink
  • Avatar
    Samantha (she/her)

    Hi J LoPiccolo,

    Are you using the '2-gatk-hard-filtering-tutorial' notebook? The tutorial outlines all the steps to download you files to your notebook in the 'Set up your files' and 'Download Data to the Notebook' sections. If your VCF files are located in different folders, you'll need to gsutil cp​each folder separately.

    Best,

    Samantha

    0
    Comment actions Permalink
  • Avatar
    J LoPiccolo

    Thanks, I was able to do this. One other question for hard filtering- will I need to filter each VCF (so each sample) separately? I have over 100 samples, so this seems like it will be tedious if each one needs to be filtered manually. Is there a way to merge all VCFs?

    Thank you,

    Jackie 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi J LoPiccolo,

    Our general GATK best practices recommendations for 100 samples would be to run HaplotypeCaller in GVCF mode, combine the files with GenomicsDBImport, then run VQSR for filtering. You can read more about how to do filtering here: https://gatk.broadinstitute.org/hc/en-us/articles/360035531112--How-to-Filter-variants-either-with-VQSR-or-by-hard-filtering

    VQSR would be the easiest option to not do any manual work, since you build the model with your samples together.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    J LoPiccolo

    Hi thanks, that is indeed what I am going to do. Would the GenomicsDBImport automatically run with the joint genotyping step (1-4) of the GATK pipeline? Right now I am running Haplotype Caller (1-2), then Generate Sample Map (1-3), then Joint Genotyping (1-4). My understanding is that VQSR is part of the Joint Genotyping, which should generate 1 VCF with all of the joint calls as output. Just wanted to make sure that's correct.

    Thank you!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi J LoPiccolo,

    I'm not sure which workspace you are running so I'm not sure what happens automatically. Maybe Samantha (she/her) has insight on that?

    But for VQSR in general, yes, VQSR will output 1 VCF with the joint calls.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Samantha (she/her)

    Hi J LoPiccolo,

    GenomicsDBImport does run in the 1-4-JointGenotyping-HG38 workflow as the ImportGVCFs task. 

    https://github.com/broadinstitute/warp/blob/20dbe7e7c602c116203523055da3399dbee9b399/tasks/broad/JointGenotypingTasks.wdl#L77

    Best,

    Samantha

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk