Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

picard sortvcf: what does it do?

Answered
0

7 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi Thomas Juettemann,

    Could you specify what rearrangements you are seeing in the header? The contig lines should be staying the same and the VCF records should match those contig lines. For the other header lines, I believe we sort them alphabetically. Let me know if that is what you are seeing.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Thomas Juettemann

    Hello Genevieve Brandt (she/her),

    thanks for your reply. Is this assumption correct:

    1. Contig lines are last in the header, and their order is not changed
    2. Other lines are sorted by:
      1. First letter of type (FORMAT before INFO)  
      2. ID value (AD before GQ)
    3. INFO fields in the records are sorted alphabetically

    Also, do you have an overview of the reserved subfields that you are using, and how you update them if they are present?

    Thanks!

    0
    Comment actions Permalink
  • Avatar
    Thomas Juettemann

    Hello Genevieve Brandt (she/her),

    one more question: I have a VCF file with these two lines:

    ##bcftools_viewCommand=view -R /scratch/genes_with_omim_id.bed Sample_Diag-hard-filtered.vcf.gz
    ##bcftools_viewCommand=view /scratch/dad.srt.vcf.gz 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y

    After the sorting I am only left with the first line:

    ##bcftools_viewCommand=view -R /scratch/genes_with_omim_id.bed Sample_Diag-hard-filtered.vcf.gz

    Is there a reason to prevent that?

    Thanks!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Thomas Juettemann,

    Thanks for sharing that specific example of the deletion. I don't think it should be happening. Do you have any examples of rearrangements in the header (what you mentioned earlier)? The examples help to determine what the issue might be. 

    Also, take a look at this article for Variant Call Format: https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format. After the contig lines there are also reference lines.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Thomas Juettemann

    Hello Genevieve Brandt (she/her),

    thanks for the document. The main issues I am encountering are with the reserved keywords in the INFO field (see p5 here). sortVCF changes some of them, see examples below. I'd argue that this should either be an error or done with a strict flag. Changing the file and the INFO values  (DB=4.5 to DB;) silently can cause headaches. Also, why not changing everything? Some of the reserved fields are not touched (H2, H3).

    Any insights would be great!
    Thanks,
    Thomas

    1.
    Input:
    ##fileformat=VCFv4.3
    ##fileformat=VCFv4.1
    ##INFO=<ID=DB,Number=1,Type=Integer,Description="Database version">
    ##INFO=<ID=SB,Number=1,Type=String,Description="Testing what SB does">
    ##INFO=<ID=H3,Number=1,Type=String,Description="Testing what H3 does">
    ##INFO=<ID=H2,Number=1,Type=String,Description="Testing what H2 does">
    ##INFO=<ID=END,Number=1,Type=String,Description="Testing what END does">
    2 238 . A G 5628.13 PASS DB=4.5;H3=5;H2=4;END=60;SB=0.1;DP=179;FS=0;MQ=60;QD=31.44 GT:AD:DP:GQ:PL 1/1:0,177:177:99:5656,531,0

    ##fileformat=VCFv4.2
    ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
    ##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
    ##INFO=<ID=H2,Number=1,Type=String,Description="Testing what H2 does">
    ##INFO=<ID=H3,Number=1,Type=String,Description="Testing what H3 does">
    ##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">
    2 238 . A G 5628.13 PASS DB;DP=179;END=60;FS=0;H2=4;H3=5;MQ=60;QD=31.44;SB=0.1 GT:AD:DP:GQ:PL 1/1:0,177:177:99:5656,531,0

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks Thomas, I'll look into this on my end and let you know what I find.

    1
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Thomas Juettemann,

    I verified with my team why you are seeing these changes. Here is a resource that the GATK follows for VCFs: The Variant Call Format (VCF) Version 4.2 Specification.

    1. I confirmed that yes, the header lines (other than the contig lines) will be rearranged alphabetically. 
    2. Following the VCF Specifications, only one file format line is permitted. SortVCF outputs a VCF in version 4.2, which is why your file format line is replaced as such. (##fileformat=VCFv4.2)
    3. The reason the bcftools_viewCommand lines in your header are being consolidated into one is because they are incorrectly formatted with a space. Because of the space, the lines are viewed as the same in the code (bcftools_viewCommand, view). A command line with spaces needs to be properly formatted to be read correctly by GATK. You can see some examples in our VCF article: https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format

    Let me know if you have further questions. 

    Best,

    Genevieve

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk