VariantRecalibrator/ApplyVQSR SNP/INDEL separation does not work
Dear GATK Team, I am using GATK 4.4.0.0. I went through the VariantRecalibrator process on a small amount of patients WES data and thereafter ApplyVQSR.
The command for SNP was:
CONCAT=""
for j in {1..22} X Y M
do
CONCAT="${CONCAT}--L chr$j "
done
/mnt/f/VMWare/Shared/gatk-4.4.0.0/gatk VariantRecalibrator \
-R grch38_1kgmaj_snvs.fa \
-V output.vcf.gz \
--resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg38.vcf.gz \
--resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg38.vcf.gz \
--resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg38.vcf.gz \
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_146.hg38.vcf.gz \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP \
-mode SNP \
-O output.snp.recal \
$CONCAT \
--tranches-file output.snp.tranches \
--rscript-file output.snp.plots.R
then applied
/mnt/f/VMWare/Shared/gatk-4.4.0.0/gatk ApplyVQSR \
-R grch38_1kgmaj_snvs.fa \
-V output.vcf.gz \
-O output_snp.vcf.gz \
--truth-sensitivity-filter-level 99.5 \
--tranches-file output.snp.tranches \
--recal-file output.snp.recal \
-mode SNP
for INDEL:
/mnt/f/VMWare/Shared/gatk-4.4.0.0/gatk VariantRecalibrator \
-R grch38_1kgmaj_snvs.fa \
-V output.vcf.gz \
--max-gaussians 4 \
--resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_146.hg38.vcf.gz \
-an QD -an ReadPosRankSum -an FS -an SOR -an MQRankSum -an DP \
-mode INDEL \
-O output.indel.recal \
$CONCAT \
--tranches-file output.indel.tranches \
--rscript-file output.indel.plots.R
and applied:
/mnt/f/VMWare/Shared/gatk-4.4.0.0/gatk ApplyVQSR \
-R grch38_1kgmaj_snvs.fa \
-V output.vcf.gz \
-O output_indel.vcf.gz \
--truth-sensitivity-filter-level 99.0 \
--tranches-file output.indel.tranches \
--recal-file output.indel.recal \
-mode INDEL
Thank you,
Sigurd
-
In your case it is possible that you might be observing multiallelic loci with SNP and an overlapping INDEL. To overcome this issue you need to split multiallelics to biallelics (you may use bcftools norm for this task) and later on select SNPs and INDELs into separate files (use e.g. gatk SelectVariants) to filter. Once you are done filtering your SNPs and INDELs you can combine them into a single file with ease.
I hope this helps.
Regards.
-
Thank you very much for your help!
I separated the multiallelic entries after the extraction with GenotypeGVCFs from the database using the following command:
bcftools norm -m -any -Oz -o output_splitma.vcf.gz output.vcf.gz
this step worked well, then I reprocessed the resulting vcf file with the initially described steps. I compared the resulting SNP and INDEL vcf files with bcftools isec with the following result:
Out of a total of 77680 variants 5023 were unique to the INDEL file (mostly SNPs) and 2166 were unique to the SNP file (mostly INDELs)?? The remaining variants were shared by both, the SNP and the INDEL vcf file (mainly SNPs with a few INDELs).
-
Hi again.
We do not directly provide any support for bcftools however looks like the norm parameter -m-any is not doing what you expect it to do.
-m, --multiallelics -|+[snps|indels|both|any]
split multiallelic sites into biallelic records (-) or join biallelic sites into multiallelic records (+). An optional type string can follow which controls variant types which should be split or merged together: If only SNP records should be split or merged, specify snps; if both SNPs and indels should be merged separately into two records, specify both; if SNPs and indels should be merged into a single record, specify any.In order to split multiallelics into separate records in VCF you need to use
bcftools norm -m-both
I hope this helps.
-
Thanks for pointing this out, but the -both flag worked the same as -any, so in the end there is no difference...
-
After you split your multiallelics to biallelics can you split your VCF file into 2 separate files one containing SNPs and another containing INDELs to see if your separate files still contain SNPs or INDELs ? You need to continue filtering on those separate files and finally you can merge them into a single VCF to contain all variants.
-
I split the database derived vcf file to SNPs & INDELs using the -v option of the bcftools view call, as expected the files contained only SNPs & INDELs, so there was no wrong sorting. I guess you mean I should inject this two files into the recalibration workflow?
-
Yes exactly. Once SNP and INDEL recalibrations and hard filtering (if that is also what you would like to do) done you can combine those filtered files.
-
Thank you very much for you help, this seems to work !
Please sign in to leave a comment.
8 comments