Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

VariantRecalibrator/ApplyVQSR SNP/INDEL separation does not work

0

8 comments

  • Avatar
    Gökalp Çelik

    Hi Sigurd Krieger

    In your case it is possible that you might be observing multiallelic loci with SNP and an overlapping INDEL. To overcome this issue you need to split multiallelics to biallelics (you may use bcftools norm for this task) and later on select SNPs and INDELs into separate files (use e.g. gatk SelectVariants) to filter. Once you are done filtering your SNPs and INDELs you can combine them into a single file with ease. 

    I hope this helps. 

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Sigurd Krieger

    Thank you very much for your help!

    I separated the multiallelic entries after the extraction with GenotypeGVCFs from the database using the following command:

    bcftools norm -m -any -Oz -o output_splitma.vcf.gz output.vcf.gz

    this step worked well, then I reprocessed the resulting vcf file with the initially described steps. I compared the resulting SNP and INDEL vcf files with bcftools isec with the following result:

    Out of a total of 77680 variants 5023 were unique to the INDEL file (mostly SNPs) and 2166 were unique to the SNP file (mostly INDELs)?? The remaining variants were shared by both, the SNP and the INDEL vcf file (mainly SNPs with a few INDELs).

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi again. 

    We do not directly provide any support for bcftools however looks like the norm parameter -m-any is not doing what you expect it to do. 

    -m, --multiallelics -|+[snps|indels|both|any]

    split multiallelic sites into biallelic records (-) or join biallelic sites into multiallelic records (+). An optional type string can follow which controls variant types which should be split or merged together: If only SNP records should be split or merged, specify snps; if both SNPs and indels should be merged separately into two records, specify both; if SNPs and indels should be merged into a single record, specify any.

    In order to split multiallelics into separate records in VCF you need to use 

    bcftools norm -m-both

    I hope this helps. 

    0
    Comment actions Permalink
  • Avatar
    Sigurd Krieger

    Thanks for pointing this out, but the -both flag worked the same as -any, so in the end there is no difference...

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    After you split your multiallelics to biallelics can you split your VCF file into 2 separate files one containing SNPs and another containing INDELs to see if your separate files still contain SNPs or INDELs ? You need to continue filtering on those separate files and finally you can merge them into a single VCF to contain all variants. 

    0
    Comment actions Permalink
  • Avatar
    Sigurd Krieger

    I split the database derived vcf file to SNPs & INDELs using the -v option of the bcftools view call, as expected the files contained only SNPs & INDELs, so there was no wrong sorting. I guess you mean I should inject this two files into the recalibration workflow? 

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Yes exactly. Once SNP and INDEL recalibrations and hard filtering (if that is also what you would like to do) done you can combine those filtered files.

    1
    Comment actions Permalink
  • Avatar
    Sigurd Krieger

    Thank you very much for you help, this seems to work !

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk