Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Large number of variants called using haplotypecaller

0

10 comments

  • 0
    Comment actions Permalink
  • Avatar
    danilovkiri

    Hi Leila farajzadeh

    The most important take-away message here is to read the documentation carefully. In a nutshell, the ERC modes GVCF and BP_RESOLUTION produce GVCF files which contain genotyping information for ALL sites found both in the reference genome and in the BAM, i.e. the GVCF produced under ERC BP_RESOLUTION mode will contain one entry per one nucleotide in the reference genome. That being said, the output VCF will contain more than 3 billion entries for the entire human genome. You are wrong calling these entries as "variants", GVCF entry is not a variant unless was genotyped by GenotypeGVCFs and a variant allele was called. As for ERC GVCF mode, it produces separate entries for genome sites which do contain variation (which is still to be processed by GenotypeGVCFs). However, it produces a single block for all adjacent apparently homozygous reference entries. That is exactly the reason for a large number of entries in the produced VCF. This behaviour is OK and is not to be questioned. In case you do not specify the ERC mode, HC produces VCF (already genotyped — ALT field is filled with exact ALT alleles found) containing ONLY variant entries, it does not contain entries for homozygous reference genotypes where variation was not found by HC.

    Using either of ERC modes might be of help when you do need to get not only variant genotypes but also homozygous reference genotypes and uncalled genotypes (due to various reasons, primarily for low coverage).

    Either of these options (ERC BP_RESOLUTION or ERC GVCF) is recommended to be used for further separate single-sample GVCFs merge. After GenotypeGVCFs step, you can filter out homozygous reference genotypes (if homozygosity happens across ALL samples) if you do not need them.

    The reason ERC mode is necessary is as follows. Imagine you genotype two BAM files (two biologically different samples) without using ERC mode. In that case, for sample_A and sample_B you get the following entries (artificial example):

    # for sample_A
    chr7 1511133 . C G 364.77 . <some_info> GT:AD:DP:GQ:PL 0/1:11,12:23:99:393,0,319
    chr7 1522937 . G C 149.77 . <some_info> GT:AD:DP:GQ:PL 0/1:17,7:24:99:178,0,445
    # for sample_B
    chr7 1522937 . G C 149.77 . <some_info> GT:AD:DP:GQ:PL 0/1:17,7:24:99:178,0,445

    If you combine these two VCFs using e.g. BCFtools merge, the VCF you get will look as below:

    #CHR    POS     ID      REF     ALT     QUAL    PASS    INFO            FORMAT  sample_A  sample_B
    chr7 1511133 . C G 364.77 . <some_info> GT 0/1 ./.
    chr7 1522937 . G C 149.77 . <some_info> GT 0/1   0/1

    Note that the chr7:1511133 genotype for sample_B is uncalled (but it might be just homozygous reference, you'll never know if you do not use ERC).

    On the contrary, if you use ERC mode, the chr7:1511133 GT for sample_B might turn into 0/0 (if it has enough coverage and passes all the filters). This is crucial for various types of analysis of large cohorts and now you see why. ./. (uncalled) and 0/0 genotypes are completely different.

    0
    Comment actions Permalink
  • Avatar
    Leila farajzadeh

    Hi Dear danilovkiri

    Thank you very much for the very nice reply. It was very comprehensive. Now I know the reason of using ERC mode as well as getting high number of called entries, NOT variant. It really helped to understand what is going on.

    Can I use the below scripts in order to filter for homozygous reference entries?

    echo "VariantFiltration"
    $gatk --java-options "-Xmx96g -Xms96g" VariantFiltration \
    -V $input/final_all.vcf \
    -O $input/final_all_ishomref.vcf \
    --genotype-filter-expression "isHomRef == 1" \
    --genotype-filter-name "isHomFilter" \
    2> $input/final_all_ishomref.log
    date

     

    echo "SelectVariants "
    $gatk --java-options "-Xmx96g -Xms96g" SelectVariants \
    -V $input/final_all_ishomref.vcf \
    --set-filtered-gt-to-nocall \
    -O $input/final_all_ishomref_removed.vcf \
    2> $input/final_all_ishomref_removing.log
    date

    If not, what is your recommendation? You have been very helpful. Many thanks.

    Best regards

    Leila

    0
    Comment actions Permalink
  • Avatar
    Leila farajzadeh

    Hi Bhanu Gandham

    Thanks for the recommendation, I have read them now, by the very useful reply from danilovkiri, I think I could find out what is the problem, that even after GenotypeGVF I get so many entries. If you have some Idea about filtration homozygous reference genotypes I will be vary happy to have your advice.

     

    Best Regard

    Leila

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    Hi Leila farajzadeh

    Unfortunately, I do not have any experience with VariantFiltration and SelectVariants tools. I do recommend to use conventional tools for such tasks like BCFtools (http://samtools.github.io/bcftools/bcftools.html). It has a vast functionality which is perfectly described in the documentation referenced above.

    To filter VCF entries (exclude OR/AND include) you have to use the main command `bcftools view` specifying additional features like --include OR/AND --exclude with expressions. The example below excludes entries which have ALT allele present `GT="alt"` and do not have a PASS statement in the FILTER column. It also excludes entries with ALT alleles present and a zero GQ value from FORMAT filed. Another exclusion criterion is a reference homozygous genotype `GT="ref"` with FORMAT filed RGQ value not exceeding 15. FORMAT/DP<=20 filter genotypes by the DP value, GT~"\." — uncalled genotypes. This tool is convenient as you can configure any expression you want. Have a look at the documentation and at the example below for syntax reference.

    bcftools view --threads 11 --exclude \
    '(GT="alt" & FILTER!="PASS") | (GT="alt" & FORMAT/GQ<=0) | (GT="ref" & FORMAT/RGQ<=15) | (FORMAT/DP<=20) | (GT~"\.") | (QUAL<=50)' \
    --output-type z --no-update /home/ubuntu/Analysis/barcode/vqsr_output/barcode.recalibrated.vcf.gz --output-file \
    /home/ubuntu/Analysis/barcode/vcf_filtering_output/barcode.prefiltered.vcf.gz
    0
    Comment actions Permalink
  • Avatar
    Leila farajzadeh

    Dear danilovkiri

    once again thank you very much for your reply. It was very clear and understandable. I will try the the bcftools to filter the VCF file from GATK and let you know the results.

     

    Best Regards

    Leila

    0
    Comment actions Permalink
  • Avatar
    Leila farajzadeh

    Dear danilovkiri

    First of all, I would like to say a big thank you to you as I have learned very important points from you. 

    I have used the bcftools in order to filter my data, but this time I get very few variants, for example from  7,604,296 entries from my  "GenotypeGVCFs" using the below command  only 1896 remains.

    I have tried different combinations of the command and the number of variants are written above each:

    1896
    bcftools view --threads 11 --exclude \
    '(GT="alt" & FILTER!="PASS") | (GT="alt" & FORMAT/GQ<=0) | (GT="ref" & FORMAT/RGQ<=15) | (FORMAT/DP<=20) | (GT~"\.") | (QUAL<=50)' \
    --no-update $input/final_all.vcf --output-file \
    $output/final_filtered_dan_bcf.vcf

     

    7,604,290
    bcftools view --threads 11 --exclude \
    '(GT="alt" & FILTER!="PASS") | (GT="alt" & FORMAT/GQ<=0) | (GT="ref" & FORMAT/RGQ<=15) ' \
    --no-update $input/final_all.vcf --output-file $output/final_filtered_dan_bcf_v2.vcf

     


    156,082
    bcftools view --threads 11 --exclude \
    '(GT="alt" & FILTER!="PASS") | (GT="alt" & FORMAT/GQ<=0) | (GT="ref" & FORMAT/RGQ<=15) | (GT~"\.")' \
    --no-update $input/final_all.vcf --output-file $output/final_filtered_dan_bcf_v3.vcf

     

    1896
    bcftools view --threads 11 --exclude \
    '(GT="alt" & FILTER!="PASS") | (GT="alt" & FORMAT/GQ<=0) | (GT="ref" & FORMAT/RGQ<=15) | (FORMAT/DP<=20) | (QUAL<=50)' \
    --no-update $input/final_all.vcf --output-file $output/final_filtered_dan_bcf_v4.vcf


    156,082
    bcftools view --threads 11 --exclude \
    '(GT~"\.")' \
    --no-update $input/final_all.vcf --output-file $output/final_filtered_dan_bcf_v5.vcf

     

    1937
    bcftools view --threads 11 --exclude \
    '(GT="alt" & FILTER!="PASS") | (GT="alt" & FORMAT/GQ<=0) | (GT="ref" & FORMAT/RGQ<=15) | (FORMAT/DP<=20)' \
    --no-update $input/final_all.vcf --output-file $output/final_filtered_dan_bcf_v7.vcf


    156,082
    bcftools view --threads 11 --exclude \
    '(GT="alt" & FILTER!="PASS") | (GT="alt" & FORMAT/GQ<=0) | (GT="ref" & FORMAT/RGQ<=15) |(GT~"\.") | (QUAL<=50)' \
    --no-update $input/final_all.vcf --output-file \
    $output/final_filtered_dan_bcf_v8.vcf


    1,896
    bcftools view --threads 11 --exclude \
    '(GT="alt" & FILTER!="PASS") | (GT="alt" & FORMAT/GQ<=0) | (GT="ref" & FORMAT/RGQ<=15) | (FORMAT/DP<=20) | (GT~"\.")' \
    --no-update $input/final_all.vcf --output-file \
    $output/final_filtered_dan_bcf_v9.vcf

     

    I think the point is that I filter across all samples, and if one of the samples does not fulfill the criterion, the entry will be filtered completely , therefore we are losing many of our variants. Do you have any opinion about this data?

    The point is that this dateset has been also analysed using older version of GATK and we had more than 250,000 variant called out of it.

    Your suggestion is very much appreciated.

     

    Best Regards

    Leila

     

     

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    Hi Leila farajzadeh

    I guess you are right. It would be helpful to read this page http://samtools.github.io/bcftools/howtos/filtering.html about the usage of &, &&, | and || boolean operators which exactly address your multisample problem. In order for it to work, it would be better to use --include criterion with a corresponding expression. Anyway, the criteria I provided in the example BCFtools command are not for exact usage, it is just an example of the syntax. You have to derive the criteria yourself based on the task. If you just want to filter HOM REF calls, it surely does not make sense if any of the samples has a non-HOM-REF genotype foe a given site. In case all samples have a HOM REF genotype for a given site, it could be possible to filter them via

    bcftools view -i 'GT[*]="alt"' input.vcf.gz

    Again, explanation of  [*] is here http://samtools.github.io/bcftools/bcftools.html#expressions. You can also filter by calculated on the fly variables like total ALT allele counts in a batch of samples, etc. Examples are also accessible via the link above.

    0
    Comment actions Permalink
  • Avatar
    Leila farajzadeh

    Hi dear danilovkiri

    I am so pleased with your guidance. It was very nice and informative. This project is aiming to  find the rare variants for Schizophrenia from an isolated population,and this is the variant calling step before any statistical test. If you have some advice on the criterion for filtering in order to select the most informative list of variant  (not to loose so many variants and not to have many false positives)you are very much appreciated.

     

    Best Regards

    Leila

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    Hi Leila farajzadeh

    I think you definitely have to apply VQSR to your VCF. The documentation for VQSR can be found in Tool Index, it is also necessary to read this doc. VQSR will mark the "good' variants with PASS in the FILTER VCF column. You'll have to set the tranches for VQSR based on what your expectation vs reality. You may then select all PASS variants and use BCFtools to get variants with AF exceeding/not exceeding a threshold specifying `--include 'INFO/AF<=0.1'` or somewhat similar. Basically, VQSR should be enough for general filtering. AF filtering can be done in multiple ways, one of them is using BCFtools.

     

    Good luck!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk