How do I import my data into notebook for germline filtering?
AnsweredI have generated VCF files from prior runs of HaplotypeCaller on germline data according to GATK best practices. I am now trying to do hard filtering and VQSR - how do I use this data for filtering? Do I reference the workspace bucket directly or do I have to copy the data into the workspace into a separate folder, etc?
-
Hi J LoPiccolo,
I am going to move your post into our Community Discussions -> General Discussion topic, as the germline topic is for reporting bugs and issues with GATK.
You can read more about our forum guidelines and the topics here: Forum Guidelines.
Best,
Pamela
-
Hi J LoPiccolo,
Are your VCF files stored in your workspace bucket? The Jupyter notebook VM storage is separate from your workspace bucket, so in order to analyze data from your workspace bucket in a notebook, you'll need to import the data to the notebook VM. Here's an article that goes over how to do so: Analyzing data from a workspace bucket in a notebook.
You may also want to consider copying the files over to your persistent disk, rather than just the notebook VM. The persistent disk allows you to keep important analysis data (installed packages, input, output data) in the event that you need to delete and re-create your cloud environment.
Please let me know if you have any questions.
Best,
Samantha
-
Thank you! Will it matter where in my workspace bucket the VCFs are located (for example each one is in a different folder), or will the filtering algorithm in the notebook be able to pull them all regardless of location?
-
Hi J LoPiccolo,
I'm not sure if I understand your question. You'll need to copy all files you want to analyze in your notebook from your workspace bucket over to your notebook VM storage or persistent disk, and you can organize the files however you want. It shouldn't matter what folders they are in, as long as the commands in your notebook know what data it needs to run on.
If you'd like us to take a closer look at your notebook, please share your workspace with GROUP_FireCloud-Support@firecloud.org and let us know the name of your workspace.
Best,
Samantha
-
Thanks, the name of the workspace is PROACTIVE-WGS_JL correct billing acct
I have shared it with you. Thank you!
-
Hi J LoPiccolo,
Are you using the '2-gatk-hard-filtering-tutorial' notebook? The tutorial outlines all the steps to download you files to your notebook in the 'Set up your files' and 'Download Data to the Notebook' sections. If your VCF files are located in different folders, you'll need to
gsutil cp
each folder separately.Best,
Samantha
-
Thanks, I was able to do this. One other question for hard filtering- will I need to filter each VCF (so each sample) separately? I have over 100 samples, so this seems like it will be tedious if each one needs to be filtered manually. Is there a way to merge all VCFs?
Thank you,
Jackie
-
Hi J LoPiccolo,
Our general GATK best practices recommendations for 100 samples would be to run HaplotypeCaller in GVCF mode, combine the files with GenomicsDBImport, then run VQSR for filtering. You can read more about how to do filtering here: https://gatk.broadinstitute.org/hc/en-us/articles/360035531112--How-to-Filter-variants-either-with-VQSR-or-by-hard-filtering
VQSR would be the easiest option to not do any manual work, since you build the model with your samples together.
Best,
Genevieve
-
Hi thanks, that is indeed what I am going to do. Would the GenomicsDBImport automatically run with the joint genotyping step (1-4) of the GATK pipeline? Right now I am running Haplotype Caller (1-2), then Generate Sample Map (1-3), then Joint Genotyping (1-4). My understanding is that VQSR is part of the Joint Genotyping, which should generate 1 VCF with all of the joint calls as output. Just wanted to make sure that's correct.
Thank you!
-
Hi J LoPiccolo,
I'm not sure which workspace you are running so I'm not sure what happens automatically. Maybe Samantha (she/her) has insight on that?
But for VQSR in general, yes, VQSR will output 1 VCF with the joint calls.
Best,
Genevieve
-
Hi J LoPiccolo,
GenomicsDBImport does run in the 1-4-JointGenotyping-HG38 workflow as the ImportGVCFs task.
Best,
Samantha
Please sign in to leave a comment.
11 comments