Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Sortvcf generated empty file. GATK4

0

13 comments

  • Avatar
    danilovkiri

    Hi.

    It seems like the VCF you are trying to sort is malformed (specifically some of the allele annotations where a float is expected with a dot as a decimal separator). Could you please provide a header and some of the VCF body lines here?

    By the way, the `vcf-sort -c` VCFtools plugin is much more convenient (http://vcftools.sourceforge.net/perl_module.html). Basically, it does simple sorting like sort -k1,1V -k2,2n for the VCF body and leaves the header intact. You can do it manually as well if you separate the header (bcftools view -h OR grep "^#"). This option does not check the integrity of VCF files so it is best to check it prior to doing anything. 

    0
    Comment actions Permalink
  • Avatar
    Cecilia Kardum Hjort

    Hi, danilovkiri

    Okey, that's strange. I'm new to this so I'm sorry for not knowing how to provide you with the header and some of the body lines in a proper way.

     

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    Cecilia Kardum Hjort

    I don't see anything malformed here or at least I can't find. Could you please run `grep -v "^#" | grep "3927,07"' on the file since the string 3927,07 (with comma) causes an exception. Copy the output as text and paste it here. If the output is too large (I hope it will be not), you can use GATK ValidateVariants to check the VCF file for malformity and paste its output here. Check the documentation for GATK ValidateVariants prior to use (also some tips are here https://gatk.broadinstitute.org/hc/en-us/articles/360037057272-ValidateVariants).

    0
    Comment actions Permalink
  • Avatar
    Cecilia Kardum Hjort

    Where do I add the name of the file in the command line?

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    If your file is gzipped (has a suffix .gz at the very end of the file name) use the following:

    gunzip -c -d FILENAME | grep -v "^#" | grep "3927,07"
    OR (to speed up and if you have bgzip installed)
    bgzip -@ N_THREADS -c -d FILENAME | grep -v "^#" | grep "3927,07"

    For GATK ValidateVariants refer to the documentation via the link above. I hope you are using Linux/Ubuntu.

     

    I suggest you study the command line syntax for essential commands like grep, cat, head, tail, awk. Also, try BCFtools, it is really convenient and fast in most cases.

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    I guess your another post https://gatk.broadinstitute.org/hc/en-us/community/posts/360065950491-Error-message-VariantFiltration-GATK-4-1-4-1 has the same problem given the error log. Did you manage to solve the issue reported in that post?

    0
    Comment actions Permalink
  • Avatar
    Cecilia Kardum Hjort

    the command gave me nothing (no error, but it ran for a bit but no output)

    gunzip -c -d FILENAME | grep -v "^#" | grep "3927,07"

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    Please use GATK ValidateVariants then and look out for its error log. It should mention the CHR:POS identifier for a malformed entry. Copy this entry and paste here if one exists.

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    Cecilia Kardum Hjort

    I got it. The QUAL values for many of the VCF entries in your file are floats, not integers. Despite the fact that VCF files (at least since VCFv4.1) support float values in the QUAL field, some of the tools don't. This is the exact reason. I suggest you regenerate your VCF using the integer QUAL scores (if available, look into the documentation of the tools you have used to create VCFs) or write a script which will convert floats to integers. If you cannot write a script yourself I can help you with that. Ask if necessary.

    0
    Comment actions Permalink
  • Avatar
    Cecilia Kardum Hjort

    danilovkiri, I sorted out the problem! I used GATK in the command line instad of PICARD and that seemed to work. 

    So this command:

    java -jar $GATK_HOME/gatk-package-4.1.4.1-local.jar MergeVcfs

    instead of this:

    java -Xmx32G -jar $PICARD_HOME/picard.jar MergeVcfs

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    Cecilia Kardum Hjort

    Please look at your very initial query in this thread and describe how you solved the problem. You were talking about SortVCF, not MergeVCFs. Please elaborate so that other users could benefit from your experience.

    1
    Comment actions Permalink
  • Avatar
    Cecilia Kardum Hjort

    I apologize, I copied the wrong scripts.

    This was the initial command line that I tried to run and got an error message:

    java -Xmx32G -jar $PICARD_HOME/picard.jar SortVcf \
    I=SelectedINDELS_final.vcf.gz \
    I=SelectedSNPS_final.vcf.gz \
    O=sorted_combined.vcf.gz

    Instead, I sorted the two vcf files seperately, changed to GATK instead of Picard and added the genome dictionary:

    java -jar $GATK_HOME/gatk-package-4.1.4.1-local.jar SortVcf \

          -I SelectedINDELS_final.vcf.gz \

          -O sorted_indels.vcf.gz \

          -SD /proj/snic2020-16-43/ref/Bombus_terrestris.Bter_1.0.dna.toplevel.dict

    1
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Thank you danilovkiri!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk