Scatter Gather principle by chromosome on Gatk
Hi all,
On a quest to optimize gatk pipeline, I met scatter gather principle, so I did following,
pids=
for chr in chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY
do
gatk --java-options "-Xmx5g" HaplotypeCaller -R $hg38 -I dedup.bqsr.bam --dbsnp assembly38.dbsnp138.vcf.gz -O $chr.vcf --native-pair-hmm-threads 10 -L $chr &
pids+=" $!"
done;
Similar logic for BaseRecalibrator and ApplyBQSR
Merging is done like below,
gatk --java-options "-Xmx16g" MergeVcfs \
I=$out1/variantCaller/HaplotypeCaller/chr1.vcf \
..
I=$out1/variantCaller/HaplotypeCaller/chr22.vcf \
I=$out1/variantCaller/HaplotypeCaller/chrX.vcf \
I=$out1/variantCaller/HaplotypeCaller/chrY.vcf \
O=$out1/variantCaller/HaplotypeCaller/output_variants_16-chunk_merged.vcf.gz
However, the output after merging them is not same as when running as entire genome together.
Also for ApplyBQSR, the summation of bam file size of individual chromosome is not equal to bam file size when passed without breaking.
Questions-
1. Is this approach legitimate? if yes, what can I do to match the output.
2. If it wont match ever, but the approach is legitimate, how can I make bioinformatics team to understand it is correct approach.
3. If the above approach for scatter gather is wrong, what can I do next?
-
Thank you for your post, Sushant! I want to let you know we have received your question. We'll get back to you if we have any updates or follow up questions.
Please see our Support Policy for more details about how we prioritize responding to questions.
-
Genevieve-Brandt-she-her Hi, any update upon the question?
Please sign in to leave a comment.
2 comments