Splitting a final filtered VCF file based on sample list and MAF Impact
Hi all, I am new to bioinformatics and genomics analyses, forgive me if this question is a bit basic.
I was given the task of running short-read WGS data of E. coli sample reads through a variant-calling pipeline based closely on GATK Best Practices using GATK 4.2.5.0. I have completed the variant-calling step itself, following variant selection and filtration, however I was wondering if it were possible to split this final VCF file into separate files based on two sets of samples.
I know this is possible from reading into bcftools, using flags such as;
bcftools view -S <sample_list.txt> file.vcf > filtered.vcf
..or possibly through GATK's SelectVariants "-sn" flag;
gatk SelectVariants -V input.vcf -R reference.fasta -sn <sample_list.txt> -out sample.vcf
.. both of which I was hoping to read into further and implement. My supervisor however warned that splitting of the files based on such samples may affect the resulting allele frequencies calculated when the files were first created, which could impact results.
For further context, I am to perform a downstream genome-wide association study on this data, which can be split between two primary categories, and investigate associations in variants with these factors. Given that my understanding is that the frequencies are calculated following the final vcf file creation, I was also wondering would it not be best to split these files not after the final step but after the creation of individual g.vcfs. At this point I was hoping I could simply then implement my pipeline from consolidating gvcfs, selection of variants and filtering in parallel for these two sample cohorts.
Apologies if this request is basic, I may lack some understanding on the calculations of how variants are determined following several steps, and this may indeed be a single-line fix. Any further information on my query would be greatly appreciated, and any insight into these calculations would be welcome.
Thanks very much in advance.
-
Hi Conor Sexton
AF field in each VCF record is calculated based on number of alleles, number of alternate alleles and ploidy. Some tools tend to recalculate AF field properly when a multisample VCF is subdivided by samples but some of them do not. You need to pay attention to that part.
If you wish to protect the original AF values based on all the samples you can use
gatk VariantAnnotator
or
bcftools annotate
tools to reannotate your subsampled VCF file using the original whole VCF file and you may be able to add the original AF values as an INFO field to your new VCF file if this is what you really wish to do.
Regards.
-
Hi Gökalp Çelik ,
Thank you very much for your quick feedback. It took a while but I tried both methods and they worked as intended. I looked into your suggestion for annotations and this worked as I hoped.
Thanks very much again for your advice!
Regards.
Please sign in to leave a comment.
2 comments