Running best practices workflow step-by-step in command line
I'm trying to follow the GATK Best Practices for Germline SNPs Indels workflow as independent steps in the command line locally or within a GATK Docker container (i.e. not on a cloud service). I don't see the individual steps that are implemented in the workflow.
Specifically, starting with data pre-processing for variant discovery, I first converted the fastq files to unmapped bam (gatk FastqToSam) and validated the output (gatk ValidateSamFile). Then, next steps are to Map to Reference using BWA & MergeBamAlignments. Going to bwa outside of the GATK container, it seems like I need to run bwa aln -b because I'm feeding it a bam file. After that MergeBamAlignments requires an OUTPUT (result of alignment?), REFERENCE_SEQUENCE (the original fasta file file that reads were aligned to, ie. hg38), and UNMAPPED_BAM (output of FastqToSam?). But when I feed those files to the MergeBamAlignments, I get an error.
So big picture, I'd like to see the steps that the workflow is implementing and, ideally, the individual commands that it's running to get there. Is something like this available?
If not an error, choose a category for your question(REQUIRED):
a)How do I (......)?
-
Hi,
The following article may help clarify the specific required files for using MergeBamAlignments. Could you please clarify the files you specified when running MargeBamAlignments?
https://gatk.broadinstitute.org/hc/en-us/articles/360057439611-MergeBamAlignment-Picard-
Kind regards,
Pamela
-
Hi limetree, did the above answer allow you to run the full workflow in the end/ I ask as I'm about to try and do the same as you and was wondering what pitfalls I might run into!
-
Hi Naomi Dyer,
Are there any specific parts of the workflow that you are concerned about or running into issues with? I would recommend attempting to run the workflow on your data and posting a new ticket on the forum if you happen to run into any errors.
Kind regards,
Pamela
-
Thanks Pamela for getting back to me so quickly. I would say my concerns are a long story!
What I ultimately want to do is use the ASERead counter tool to analyse RNAseq data for allelic imbalance in F1 progeny of a cross. I got in touch with Stephane Castel to ask about its use, and realised that my experimental design makes it a bit more tricky, as due to the size of the samples it was not possible to extract sufficient RNA and DNA from the same individuals, to produce a suitable VCF file with specific sites to process. In addition, the RNAseq was done on pools of around 10 siblings for each of 6 crosses. Stephane suggested I could either generate a VCF based on the parents (losing many heterozygous sites in the process), or try calling genoptypes from the RNA seq data itself.
At present, I would say the main issue I am concerned with is that I have RNAseq on pooled samples (of Anopheles gambiae mosquitoes), and I'm not quite sure whether the workflow will still be suitable or needs tweaking? The data preprocessing for variant discovery mentions "this workflow is designed to work on individual samples".
thanks
Naomi
-
Hi Naomi Dyer,
There are definitely more considerations when working with RNAseq data due to the decreased amount of data and higher error rates. However, I think that you should still be able to run the workflow. In the first part of your response, I would suggest attempting to call genotypes from the RNAseq data itself so as to keep the maximum amount of data/variation. Here is the best practices document for calling variants from RNAseq data.
As for the second part of your response, I believe that you should still be able to run the workflow. Pooled data is generally best when working with RNAseq to maximize the coverage. Given that you have the 6 different crosses, you should be able to treat the set of data from each cross as an individual sample. You definitely have a unique use-case so I am not super familiar or certain on what exactly will yield the best results, but I would recommend attempting variant calling with the current above workflow on your RNAseq data.
Please let me know if you have any additional questions or if you run into issues/errors.
Kind regards,
Pamela
Please sign in to leave a comment.
5 comments