GenotypeGVCF by intervals
Hello,
I have a batch with 219 samples /human exomes), on which I ran genomicsDBimport by chromosomes without a problem. I am now running GenotypeGVCF also by chromosomes; for chr 1-15, the jobs completed correctly, but for the rest (chr 16-22, X, and Y) I am having problems with time and memory. I have tried to run it on smaller intervals, with more memory and time, but the jobs don't complete either.
My code:
gatk --java-options "-Xmx100g" GenotypeGVCFs -R .../ucsc.hg19.fasta -V gendb://gDB16 -L chr16:3100000-5500000 -O .../c16.vcf.gz
gatk version: gatk/4.0.10.0
Is it normal that the last chrs need more time and memory?
I did read the post https://gatk.broadinstitute.org/hc/en-us/community/posts/360063088471-Speeding-up-GenotypeGVCFS-GATK4 , and I understand that the "GenomicDB" has to be loaded completly first and then the option -L is applyed. But, is there any way to optimize this step?
Any help will be appreciated!
-
Hi Anna, we have made improvements to GenomicsDB and GenotypeGVCFs since GATK version gatk/4.0.10.0, I would recommend updating your GATK to 4.1.9.0 [our current version] to run GenotypeGVCFs. If you are running on a cluster, you can also use the new option --genomicsdb-shared-posixfs-optimizations to get the best performance.
-
Dear Genevieve,
Some of the vcf files were obtained with version gatk/4.0.10.0, and so when I run GenotypeGVCF with the updated version it shows me this:
A USER ERROR has occurred: Bad input: Presence of '-RAW_MQ' annotation is detected. This GATK version expects key RAW_MQandDP with a tuple of sum of squared MQ values and total reads over variant genotypes as the value. This could indicate that the provided input was produced with an older version of GATK. Use the argument '--allow-old-rms-mapping-quality-annotation-data' to override and attempt the deprecated MQ calculation. There may be differences in how newer GATK versions calculate DP and MQ that may result in worse MQ results. Use at your own risk.
Can you please tell me if it is ok to use the recent version despite of this error?
Thank you!
-
Hi Anna,
It is up to you and how you are using your data. There is a discussion at our legacy forum site that summarizes the changes to the RMSMappingQuality annotation. Ideally, we would recommend using the same GATK version for all steps of the platform, but if you want to get the best performance for GenotypeGVCFs, you will need to use a newer version.
-
Hi Genevieve,
ok, I will have a look and try to decide what is best at this time.
Still, I don't understand with I didn't have the same problem for chr 1-15, which ran smoothly. Do you know why? Can you please explain it to me?
Again, thank you!
-
Hi Anna,
You said some of the files were created with different versions of GATK. Do you know which version was used for chr 1-15?
Genevieve
-
Hi Genevieve,
I'm sorry. All vcf files were created with gatk/4.0.10.0. I could eventually do the calling with the updated version, but only for some of the samples.
The batch is the same for chr1-15 and chr16-22.Anna
-
Did you use the newer version of GATK with the chr1-15? If not, you would not have seen this error:
A USER ERROR has occurred: Bad input: Presence of '-RAW_MQ' annotation is detected. This GATK version expects key RAW_MQandDP with a tuple of sum of squared MQ values and total reads over variant genotypes as the value. This could indicate that the provided input was produced with an older version of GATK. Use the argument '--allow-old-rms-mapping-quality-annotation-data' to override and attempt the deprecated MQ calculation. There may be differences in how newer GATK versions calculate DP and MQ that may result in worse MQ results. Use at your own risk.
-
I used gatk/4.0.10.0 on all the steps until the genotyping. The genotyping went well for chr 1-15, but for the rest it was taking too long (event when I tried using smaller intervals). So I have tried your suggestion of using the latest version for the genotyping of chr 16-22, and that is when I have that error.
I would prefer doing everything with the same version, as you also recommended, but I cannot understand why I am having such differencies, if my scripts are the same (just change the - L option, by chr). -
Hi Anna,
I wouldn't expect that there would be a difference in the time and memory between chromosomes like you are seeing. I wonder if there is an issue with the space available in location that GenotypeGVCFs is using as temporary space. If you re-run one of the chromosomes 1-15 now with the same command, does it run easily? You can use the option --tmp-dir with 4.0.10.0 (Tool Docs page) to specify a temporary space with enough room.
Please note, the GATK Team is out of office and resolving this issue may take longer than normal.
Best,
Genevieve
-
Ok, I will try that.
Thank you so much for your help!
-
No problem, hope this solves your issue!
Please sign in to leave a comment.
11 comments