Invariant sites with REF alleles longer than 1
Hello,
I have performed variant calling of a set of whole-genome resequenced samples using GATK with the following tools: 1) creating individual GVCFs with HaplotypeCaller and the flag -ERC GVCF; 2) combining the GVCFs into GenomicsDatabases with GenomicsDBImport; and 3) joint genotyping with GenotypeGVCFs and the --all-sites true flag.
My aim is to get a vcf containing high quality SNPs and invariants. However, when I looked at the invariants I found sites where the length of the REF allele is > 1. If I understand well, this corresponds to invariant blocks as encoded in the GVCFs files, but I don't understand why invariants are sometimes encoded as blocks and sometimes not. Furthermore, I have found occurences where invariant blocks seem to overlap variants, as illustrated below. I am not sure what these records mean and how should I best handle them.
One last thing, I have realised after running the pipeline that I used GATK v4.2.5.0 for HaplotypeCaller and GATK v4.4.0.0 for GenomicsDBImport and GenotypeGVCFs. Could that cause a problem?
Thanks a lot in advance,
Loïs
-
GVCF reference confidence blocks are set in GQ bins of value from 0 to 99 and if each arbitrary level is requested from HaplotypeCaller it will emit reference blocks based on which bin they belong to and will split them necessary. If you wish to have longer and less binned blocks we have ReblockGVCF tool which can merge certain bins to a single GQ bin and reduce single nucleotide blocks. Conversely, if you are interested in obtainin per nucleotide data for the invariant sites you need to use -ERC BP_RESOLUTION parameter to activate single nucleotide resolution for your data.
It is possible that certain INDELs can overlap a SNP event or another reference block in a different sample so the output is pretty much expected.
Finally we always recommend using the latest GATK and unless indicated by our documents not mixing and matching versions. Versions after 4.2 are close to each other in terms of how variants are called but there are certainly differences where we fixed an edge case or introduced a new feature along the way. So please check github release notes for any potential differences that may affect your results.
I hope this helps.
Regards.
-
Hi Gökalp,
Thank you very much for your explanations and suggestions. I have two follow up questions:
1) What is the biological meaning of an INDEL overlapping an invariant block? In the exemple above, shouldn't the invariant block starting at position 56856 be AACCA, followed by the two INDELs/SNPs ?
2) What determines whether HaplotypeCaller emits an invariant block or a single invariant position?
Thank you again,
Loïs
-
Hi again.
There is no special biological meaning but this is the way it will show up in the VCF as you requested all sites to be genotyped meaning every single basepair will be covered in the final VCF. This means samples with INDEL will have a genotype and others will show as wild type and for sites next to it those INDEL, carrying samples will show a spanning INDEL for that reference block.
It depends on the quality of base calls and mapping that covers the assembled region in the local reassembly engine.
Regards.
-
Hello Gökalp,
I'm coming back to you as I've repeated variant calling following your suggestions with -ERC BP_RESOLUTION in HaplotypeCaller and the same version of GATK throughout (version 4.4.0.0, which is the most up to data available on the cluster I'm working in).
Even with that I still find sites (invariants and variants) where the length of the REF allele is >1. See for exemple these two snipets:
What would be the best way of dealing with that?
For reference here are the commands I used (in GATK 4.4.0.0):
CRAMLIST=($(<list_cram_R225.txt))
CRAM=${CRAMLIST[${SLURM_ARRAY_TASK_ID}]}gatk HaplotypeCaller -R ${REF} \
-I ${CRAM} \
-L ScDMRwT_1489_HRSCAF_1712_chr25 \
-ERC BP_RESOLUTION \
-O $(echo ${CRAM} | cut -d'/' -f11 | cut -d'_' -f1,2)_chr25.g.vcf.gzgatk GenomicsDBImport \
--genomicsdb-workspace-path GenoDB_chr25 \
-L ScDMRwT_1489_HRSCAF_1712_chr25 \
--sample-name-map map_smp_gvcfs.txt >> gDB_chr25.log 2>&1gatk --java-options "-Xmx30g" GenotypeGVCFs -R ${REF} \
-V gendb:///lustre/home/mncn/mfjanoher/Genomics/A_Genomes/variant_calling_MR36_R225/02_test_gatk/02_GenoDB/GenoDB_chr25 \
--all-sites true \
-L ScDMRwT_1489_HRSCAF_1712_chr25 \
-O R225_all_sites_ScDMRwT_1489_HRSCAF_1712_chr25.vcf.gz >> genotype_ScDMRwT_1489_HRSCAF_1712_chr25.log 2>&1Thank you very much for your help!
Best regards
Loïs
-
Hi again.
When there are INDEL signatures present in the reads it is expected to have the GVCF to contain those marks regardless of the quality (GVCF mode drops standard call confidence to 0 to have all possible stuff inside). Once genotyped only those that have quality above 30 gets emitted but still even if you are using BP_RESOLUTION mode you will still get INDELs called by HaplotypeCaller. But since you also select --all-sites as an option those not emitted sites will be emitted as REF calls with longer than 1bp reference alleles. If you do not wish to have INDELs in your data you may need to remove them after variant calling by selecting SNPs only. Please beware that removing INDELs will result in missing variants in some samples because that is the haplotype found for those samples.
Can you also post a whole variant line for those you think to be problematic instead of screenshots? We may need to take a look at what is going on there.
I hope this helps.
-
Hello Gökalp,
Thank you very much for the explanations.
What I am trying to achieve is a VCF with SNPs and invariant sites, and I would like to retain as many invariants as possible. Here what I am wondering is what to make of these invariant sites with the REF allele longer then 1. Should I remove them altogether or can I keep the first nucleotide as the REF allele (e.g. at position 34719 in the example above, the REF allele would become T). If the second option is appropriate, is there a tool to do that?
Thanks again.
Loïs
-
I am not aware of such tools but you may be able to manually modify the VCF file using regular awk, sed etc. tools.
I hope this helps.
Regards.
Please sign in to leave a comment.
7 comments