picard sortvcf: what does it do?
AnsweredHello Everyone,
This is a general question about the inner-working of sortVCF from a newbie. Ordering the records by position is straightforward, but there seem to be quiet a few rearrangements happening in the header and the INFO column. I could not find any other documentation aside from:
https://gatk.broadinstitute.org/hc/en-us/articles/360036858331-SortVcf-Picard-
- How is the order in the header decided?
- Which INFO fields are converted? Anything aside from the reserved sub-fields from the VCF specification?
- Is there a list of how fields are converted?
I apologise if I missed any documentation, please just point me there.
Thanks,
Thomas
-
Could you specify what rearrangements you are seeing in the header? The contig lines should be staying the same and the VCF records should match those contig lines. For the other header lines, I believe we sort them alphabetically. Let me know if that is what you are seeing.
Best,
Genevieve
-
Hello Genevieve Brandt (she/her),
thanks for your reply. Is this assumption correct:
- Contig lines are last in the header, and their order is not changed
- Other lines are sorted by:
1. First letter of type (FORMAT before INFO)
2. ID value (AD before GQ) - INFO fields in the records are sorted alphabetically
Also, do you have an overview of the reserved subfields that you are using, and how you update them if they are present?
Thanks!
-
Hello Genevieve Brandt (she/her),
one more question: I have a VCF file with these two lines:
##bcftools_viewCommand=view -R /scratch/genes_with_omim_id.bed Sample_Diag-hard-filtered.vcf.gz
##bcftools_viewCommand=view /scratch/dad.srt.vcf.gz 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X YAfter the sorting I am only left with the first line:
##bcftools_viewCommand=view -R /scratch/genes_with_omim_id.bed Sample_Diag-hard-filtered.vcf.gz
Is there a reason to prevent that?
Thanks!
-
Thanks for sharing that specific example of the deletion. I don't think it should be happening. Do you have any examples of rearrangements in the header (what you mentioned earlier)? The examples help to determine what the issue might be.
Also, take a look at this article for Variant Call Format: https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format. After the contig lines there are also reference lines.
Best,
Genevieve
-
Hello Genevieve Brandt (she/her),
thanks for the document. The main issues I am encountering are with the reserved keywords in the INFO field (see p5 here). sortVCF changes some of them, see examples below. I'd argue that this should either be an error or done with a strict flag. Changing the file and the INFO values (DB=4.5 to DB;) silently can cause headaches. Also, why not changing everything? Some of the reserved fields are not touched (H2, H3).
Any insights would be great!
Thanks,
Thomas1.
Input:
##fileformat=VCFv4.3
##fileformat=VCFv4.1
##INFO=<ID=DB,Number=1,Type=Integer,Description="Database version">
##INFO=<ID=SB,Number=1,Type=String,Description="Testing what SB does">
##INFO=<ID=H3,Number=1,Type=String,Description="Testing what H3 does">
##INFO=<ID=H2,Number=1,Type=String,Description="Testing what H2 does">
##INFO=<ID=END,Number=1,Type=String,Description="Testing what END does">
2 238 . A G 5628.13 PASS DB=4.5;H3=5;H2=4;END=60;SB=0.1;DP=179;FS=0;MQ=60;QD=31.44 GT:AD:DP:GQ:PL 1/1:0,177:177:99:5656,531,0
##fileformat=VCFv4.2
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=H2,Number=1,Type=String,Description="Testing what H2 does">
##INFO=<ID=H3,Number=1,Type=String,Description="Testing what H3 does">
##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">
2 238 . A G 5628.13 PASS DB;DP=179;END=60;FS=0;H2=4;H3=5;MQ=60;QD=31.44;SB=0.1 GT:AD:DP:GQ:PL 1/1:0,177:177:99:5656,531,0 -
Thanks Thomas, I'll look into this on my end and let you know what I find.
-
I verified with my team why you are seeing these changes. Here is a resource that the GATK follows for VCFs: The Variant Call Format (VCF) Version 4.2 Specification.
- I confirmed that yes, the header lines (other than the contig lines) will be rearranged alphabetically.
- Following the VCF Specifications, only one file format line is permitted. SortVCF outputs a VCF in version 4.2, which is why your file format line is replaced as such. (##fileformat=VCFv4.2)
- The reason the bcftools_viewCommand lines in your header are being consolidated into one is because they are incorrectly formatted with a space. Because of the space, the lines are viewed as the same in the code (bcftools_viewCommand, view). A command line with spaces needs to be properly formatted to be read correctly by GATK. You can see some examples in our VCF article: https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format
Let me know if you have further questions.
Best,
Genevieve
Please sign in to leave a comment.
7 comments