combineGVCF errors and questions: inconsistent references & combineGVCF twice
Hi there,
Over the conversations in https://gatk.broadinstitute.org/hc/en-us/community/posts/4404389488923-GenotypeGVCFs-only-accepts-single-combinedGVCF-CombineGVCFs-genomicsdb-errors-interval-list-subset, I've used combineGVCF to combine gVCFs of ~3000 samples (1000 WES & 2000 WGS) by chromosome. Since CombineGVCF could not take ~3000 samples at once (it gave me out of memory error even with 320GB allocated to java), I first shuffled the sample list and randomly divided them into 3 batch of 1000 individual g.vcf per batch. I used CombineGVCF to combine these individual g.vcf in each batch separately, and then did a second CombineGVCF to combine the three batch.g.vcf files together, as below.
Step1. CombineGVCF 1000 individual g.vcf samples in each batch separately generated batch1.g.vcf.gz, batch2.g.vcf.gz & batch3.g.vcf.gz
gatk --java-options '-Xmx120g -XX:ParallelGCThreads=2' CombineGVCFs \
-R $ref_genome \
--variant $batch${num}.list \
-O ${chr}.batch${num}.combined.cohort.g.vcf.gz \
-G StandardAnnotation \
-G AS_StandardAnnotation \
-G StandardHCAnnotation \
--intervals $chr_interval \
--interval-padding 10
Step2. CombineGVCF batch1.g.vcf.gz, batch2.g.vcf.gz & batch3.g.vcf.gz generated all.combined.cohort.g.vcf.gz
gatk --java-options '-Xmx120g -XX:ParallelGCThreads=2' CombineGVCFs \
-R $ref_genome \
--variant batch1.g.vcf \
--variant batch2.g.vcf \
--variant batch3.g.vcf \
-O ${chr}.allsample.combined.cohort.g.vcf.gz \
-G StandardAnnotation \
-G AS_StandardAnnotation \
-G StandardHCAnnotation \
--intervals $chr_interval \
--interval-padding 10
I ran step1 with no error, but Step2 worked for all other chromosomes except chr7 & chr16, in which the log showed "The provided variant file(s) have inconsistent references for the same position(s) at chr7:100965271, T* vs. G*; subsample1: T G,<non_ref>; subsample2: nothing; subsample3: G *,<non_ref>". Our bioinformatician suggested me to include more samples, so I included all the ~3000 sample and ran CombineGVCF all at once.
gatk --java-options '-Xmx500g -Xms500g -Xmn200g -XX:ParallelGCThreads=2' CombineGVCFs \
-R $ref_genome \
--variant $all_gvcf.list \
-O ${chr}.allsample.combined.cohort.g.vcf.gz \
-G StandardAnnotation \
-G AS_StandardAnnotation \
-G StandardHCAnnotation \
--intervals $target_union \
--interval-padding 10
It worked well although it took me ~500GB and 5 days to finish for chr7. I checked the previous inconsistent site chr7:100965271 and it was chr7 100965271 . T G,*,<NON_REF> . However, I'm still confused why CombineGVCF individual g.vcf in batches and then did a second CombineGVCF of batch.gvcf would give inconsistent reference error? It didn't seem the error was due to unnatural breakpoint because running CombineGVCF of all ~3000 samples worked. I used GATK version: gatk/4.1.2.0 and downloaded the ref genome hg38 from broad bundle.
Because of the error with combining batch g.vcf, I was concerned if CombineGVCF individual g.vcf in batch first and did a second CombineGVCF of the batch.gvcf would generate a different results as compared to CombineGVCF all ~3000 individual g.vcf at once. I used chr15 to test. When I CombineGVCF individual g.vcf in batches first and did a second CombineGVCF of the batch.g.vcf, it generated 2181303 sites (zcat *.gvcf | grep -v "#" | wc -l). However, when I CombineGVCF on all the 3000 individual g.vcf samples at once, it generated 3367114 sites. I then used R to check the allele of the unique sites, and found 99.9% of the unique sites were *,<NON_REF> or <NON_REF> and the remaining seemed to be real variant sites such as A,*,<NON_REF>, A,AT,ATTT,<NON_REF> and C,<NON_REF>. It made me concerned if I would miss some potential real variant when doing combineGVCF twice. Could I really do combineGVCF of individual g.vcf in batches and then do a second combineGVCF of the batch.g.vcf? I got this idea based on gatk3 legacy forum, but not sure if it still worked for gatk4.
If I cannot do combineGVCF of individual g.vcf in batches and then do a second combineGVCF of the batch.g.vcf, I would end up with 3 batch.g.vcf per chr for genotypeGVCF. However, genotypeGVCF in gatk4 only accepts ONE single gvcf as input. How could I deal with it? Should I switch back to gatk3?
Thank you!
Mingzhou
-
Hi Mingzhou,
Thanks for writing in with all this information. This is very helpful in figuring out the best path forward. I think for your case I would recommend running CombineGVCFs for all samples, but separating the runs by chromosome or intervals. You can then merge all the GVCFs for each interval with MergeVCFs.
The sites that you found that were unique in the GVCFs that only ran once through CombineGVCFs (*,<NON_REF> or <NON_REF>) are reference block sites that are helpful when running GenotypeGVCFs and doing joint calling. The method I outlined above is the method most similar to our current best practices. Our production pipeline uses MergeVCFs after running GenomicsDBImport on separate intervals.
Although I am not totally clear why CombineGVCFs gets different results when being run twice, I don't think you should run it that way because it is pretty different from our recommendations and those sites you found could definitely make a difference.
Hope this helps with your analysis!
Genevieve
-
Hi Genevieve,
Thank you very much for your quick reply! Yes, I definitely wanted to follow the method you suggested (which I did so for some chr), but this method was actually NOT feasible with limited job & memory per users on our HPC. (I did argue with our HPC team but they refused to lift the bar ...)
Since I should not run CombineGVCFs twice, I'm wondering if it is okay to switch back to gatk3 and do genotypeGVCF with the three batch.g.vcf files? Not sure if it's valid to feed the data generated by gatk4 as input in gatk3.
Thank you!
Mingzhou
-
Hi Mingzhou,
I see, I generally wouldn't recommend using data from GATK4 with GATK3. I'll reach out to people on my team who might know more of this, but it might take some time to get an answer since this is outside our best practices.
Best,
Genevieve
-
That would be great Genevienve! Thank you so much for the help so far!
-
Hi Genevieve,
Since the time & memory limiting step for combineGVCF/importgenomicDB for my ~3000 g.vcf is that the ~2000 WGS g.vcf files are too large to open/read in, I'm wondering if I can use selectVariants to trim/filter the WGS g.vcf files to the target regions used by my WES data? I mean, in general can I somehow pre-filter the g.vcf files of my WGS data and only keep the sites within specified target regions?
Thank you!
-
Yes, you could do that. You can run SelectVariants with the intervals option to only get the sites of interest in your WES intervals.
-
Hi Mingzhou Fu,
I got some more feedback from our dev team to share:
- We would not recommend going back to GATK3 for GenotypeGVCFs, the annotations will not be compatible and your results will not be accurate.
- A better solution would probably running WGS and WES separately at this point since you have not found a solution.
- When you are running your jobs with -Xmx120G, how much physical memory is available for the job? You will want quite a bit available since 120G is so much.
- You might want to think about taking this research to the cloud because 3000 samples is so many. We also offer Terra, which is a great service and has excellent support! It wouldn't necessarily be faster but you would have more flexibility with the jobs.
Let me know if I can be of any more help.
Best,
Genevieve
-
Hi Genevieve,
Thank you so much for the reply!
If I run WGS and WES separately, how can I put them all together for genotypeGVCF? I want the WGS data from 1KG to joint call together with my sample to improve variant detection.
I put 150G physical memory to -Xmx120G. Found from forum that I need to use 80%-90% of to the java script.
Yes, I did talk about it with my PI, since I not the only one got stuck in the lab...
Thank you!
Mingzhou
-
I got this WARN when using combineGVCF. I guess this was probably due to my WES were captured with different capture kit (and thus target regions) and I used the union of the target regions (as regions of interest) in -L option for combineGVCF. Just wonder if this WARN would affect any later analysis/result if I do only care about the variant within my regions of interest?
WARN CombineGVCFs - You have asked for an interval that cuts in the middle of one or more gVCF blocks. Please note that this will cause you to lose records that don't end within your interval.
-
Mingzhou Fu We don't have a way for you to combine the samples as you are describing, you would have to run GenotypeGVCFs separately.
You could lose records that span the end of the interval. If these are long records and you lose the entire record, you could be potentially losing data that could be helpful.
-
Thank you Genevieve! Regarding the lost records, is there a quick way to figure out which records were lost? Should I compare the sites/position in gVCF vs the sites/position in combined gVCF?
-
Yes you could investigate that way. Start positions missing from your combined GVCF will have been lost.
-
Gotcha! Thank you Genevieve! Last question regarding your previous suggestions - " I think for your case I would recommend running CombineGVCFs for all samples, but separating the runs by chromosome or intervals. You can then merge all the GVCFs for each interval with MergeVCFs" - why do we use MergeVCFs instead of gatherVCF? Both documents say they put together vcf files of same samples. Just wonder what's the difference between the two?
-
They are very similar. I am not totally sure of the differences but I believe GatherVCFs was written specifically for our WDL production pipelines. Probably for your case I would stick with MergeVCFs.
-
I see. Thank you very much for the explanation :)
-
You're welcome!
Please sign in to leave a comment.
16 comments