Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Invariant sites with REF alleles longer than 1

0

7 comments

  • Avatar
    Gökalp Çelik

    Hi Loïs Rancilhac

    GVCF reference confidence blocks are set in GQ bins of value from 0 to 99 and if each arbitrary level is requested from HaplotypeCaller it will emit reference blocks based on which bin they belong to and will split them necessary. If you wish to have longer and less binned blocks we have ReblockGVCF tool which can merge certain bins to a single GQ bin and reduce single nucleotide blocks. Conversely, if you are interested in obtainin per nucleotide data for the invariant sites you need to use -ERC BP_RESOLUTION  parameter to activate single nucleotide resolution for your data.

    It is possible that certain INDELs can overlap a SNP event or another reference block in a different sample so the output is pretty much expected.

    Finally we always recommend using the latest GATK and unless indicated by our documents not mixing and matching versions. Versions after 4.2 are close to each other in terms of how variants are called but there are certainly differences where we fixed an edge case or introduced a new feature along the way. So please check github release notes for any potential differences that may affect your results. 

    I hope this helps.

    Regards.  

    0
    Comment actions Permalink
  • Avatar
    Loïs Rancilhac

    Hi Gökalp,

    Thank you very much for your explanations and suggestions. I have two follow up questions:

    1) What is the biological meaning of an INDEL overlapping an invariant block? In the exemple above, shouldn't the invariant block starting at position 56856 be AACCA, followed by the two INDELs/SNPs ?

    2) What determines whether HaplotypeCaller emits an invariant block or a single invariant position?

    Thank you again,

    Loïs

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi again.

    There is no special biological meaning but this is the way it will show up in the VCF as you requested all sites to be genotyped meaning every single basepair will be covered in the final VCF. This means samples with INDEL will have a genotype and others will show as wild type and for sites next to it those INDEL, carrying samples will show a spanning INDEL for that reference block. 

    It depends on the quality of base calls and mapping that covers the assembled region in the local reassembly engine. 

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Loïs Rancilhac

    Hello Gökalp,

    I'm coming back to you as I've repeated variant calling following your suggestions with -ERC BP_RESOLUTION in HaplotypeCaller and the same version of GATK throughout (version 4.4.0.0, which is the most up to data available on the cluster I'm working in).

    Even with that I still find sites (invariants and variants) where the length of the REF allele is >1. See for exemple these two snipets:

    What would be the best way of dealing with that?

    For reference here are the commands I used (in GATK 4.4.0.0):

    CRAMLIST=($(<list_cram_R225.txt))
    CRAM=${CRAMLIST[${SLURM_ARRAY_TASK_ID}]}

    gatk HaplotypeCaller -R ${REF} \
                         -I ${CRAM} \
                         -L ScDMRwT_1489_HRSCAF_1712_chr25 \
                         -ERC BP_RESOLUTION \
                         -O $(echo ${CRAM} | cut -d'/' -f11 | cut -d'_' -f1,2)_chr25.g.vcf.gz

    gatk  GenomicsDBImport \
       --genomicsdb-workspace-path GenoDB_chr25 \
       -L ScDMRwT_1489_HRSCAF_1712_chr25 \
       --sample-name-map map_smp_gvcfs.txt >> gDB_chr25.log 2>&1

    gatk --java-options "-Xmx30g" GenotypeGVCFs -R ${REF} \
            -V gendb:///lustre/home/mncn/mfjanoher/Genomics/A_Genomes/variant_calling_MR36_R225/02_test_gatk/02_GenoDB/GenoDB_chr25 \
            --all-sites true \
            -L ScDMRwT_1489_HRSCAF_1712_chr25 \
            -O R225_all_sites_ScDMRwT_1489_HRSCAF_1712_chr25.vcf.gz >> genotype_ScDMRwT_1489_HRSCAF_1712_chr25.log 2>&1

     

    Thank you very much for your help!

    Best regards

    Loïs

     

     

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi again.

    When there are INDEL signatures present in the reads it is expected to have the GVCF to contain those marks regardless of the quality (GVCF mode drops standard call confidence to 0 to have all possible stuff inside). Once genotyped only those that have quality above 30 gets emitted but still even if you are using BP_RESOLUTION mode you will still get INDELs called by HaplotypeCaller. But since you also select --all-sites as an option those not emitted sites will be emitted as REF calls with longer than 1bp reference alleles. If you do not wish to have INDELs in your data you may need to remove them after variant calling by selecting SNPs only. Please beware that removing INDELs will result in missing variants in some samples because that is the haplotype found for those samples. 

    Can you also post a whole variant line for those you think to be problematic instead of screenshots? We may need to take a look at what is going on there. 

     

    I hope this helps. 

    0
    Comment actions Permalink
  • Avatar
    Loïs Rancilhac

    Hello Gökalp,

    Thank you very much for the explanations.

    What I am trying to achieve is a VCF with SNPs and invariant sites, and I would like to retain as many invariants as possible. Here what I am wondering is what to make of these invariant sites with the REF allele longer then 1. Should I remove them altogether or can I keep the first nucleotide as the REF allele (e.g. at position 34719 in the example above, the REF allele would become T). If the second option is appropriate, is there a tool to do that?

    Thanks again.

    Loïs

     

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    I am not aware of such tools but you may be able to manually modify the VCF file using regular awk, sed etc. tools. 

    I hope this helps. 

    Regards. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk