Which exact info does BaseRecalibrator uses from a vcf?
I'm wondering if I can use a vcf generated from ANGSD to recalibrate bases using BaseRecalibrator.
Which info the program actually extract from the vcf to run the recalibration?
ANGSD vcf output has
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
NC_045955.1 576 . C T 29 PASS NS=21;DP=78;AF=0.112661 GT:DP:GL:PL:GP
where
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of raw reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of raw reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=RPB2,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias [CDF] (bigger is better)">
##INFO=<ID=MQB2,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias [CDF] (bigger is better)">
##INFO=<ID=BQB2,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias [CDF] (bigger is better)">
##INFO=<ID=MQSB2,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias [CDF] (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##INFO=<ID=I16,Number=16,Type=Float,Description="Auxiliary tag used for calling, see description of bcf_callret1_t in bam2bcf.h">
##INFO=<ID=QS,Number=R,Type=Float,Description="Auxiliary tag used for calling">
##INFO=<ID=AF,Number=A,Type=Float,Description="Minor Allele Frequency">
##INFO=<ID=DPR,Number=R,Type=Integer,Description="Number of high-quality bases observed for each allele">
##INFO=<ID=AD,Number=R,Type=Integer,Description="Total allelic depths">
##INFO=<ID=ADF,Number=R,Type=Integer,Description="Total allelic depths on the forward strand">
##INFO=<ID=ADR,Number=R,Type=Integer,Description="Total allelic depths on the reverse strand">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-fwd, ref-reverse, alt-fwd and alt-reverse bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths">
##FORMAT=<ID=ADF,Number=R,Type=Integer,Description="Allelic depths on the forward strand">
##FORMAT=<ID=ADR,Number=R,Type=Integer,Description="Allelic depths on the reverse strand">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=DV,Number=1,Type=Integer,Description="Number of high-quality non-reference bases">
##FORMAT=<ID=DPR,Number=R,Type=Integer,Description="Number of high-quality bases observed for each allele">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="scaled Genotype Likelihoods (loglikeratios to the most likely (in log10))">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Genotype Probabilities">
(REQUIRED) Please provide:
a) GATK version used:
b) Exact command used:
c) Entire error log:
Choose a category for your question:
a)How do I (......)?
b) What does (......) mean?
c) Why do I see (......)?
d) Where do I find (......)?
e) Will (......) be in future releases?
-
Hi madzayasodara, you can use ValidateVariants to validate a VCF file for use with GATK. The VCF input to BaseRecalibrator is intended to be known sites: https://gatk.broadinstitute.org/hc/en-us/articles/360047217531-BaseRecalibrator
-
Hi, madzayasodara, there may be an error with the file you used with HaplotypeCaller, could you check your BAM file with ValidateSamFile? https://gatk.broadinstitute.org/hc/en-us/articles/360035891231-Errors-in-SAM-or-BAM-files-can-be-diagnosed-with-ValidateSamFile
Also, the error message from ValidateVariants shows you exactly where the problem is located, it may be helpful to look closer at this location to find what is happening:
"one or more of the ALT allele(s) for the record at position NC_044999.1:474291 are not observed at all in the sample genotypes"
-
Thank you :)
This is weird my vcf check fails even when I call variants using HaplotypeCaller
This is the cmd I use to call
gatk HaplotypeCaller -R GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fa -I realignmatched/RSFV1A_match.bam
...
-I realignmatched/RSFV1Z_match.bam -L NC_044999.1 --stand-call-conf 30 -mbq 20 -O raw_variantsZF_NC_044999.1.vcf
this is the cmd I use to validate
gatk ValidateVariants -R GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fa -V raw_variantsZF_NC_044999.1.vcf
This is the error I get
Using GATK jar /gpfs/rt/7.2/opt/gatk/4.1.6.0/bin/gatk-package-4.1.6.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gpfs/rt/7.2/opt/gatk/4.1.6.0/bin/gatk-package-4.1.6.0-local.jar ValidateVariants -R /users/mfariasv/data/mfariasv/newzf20/GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fa -V /users/mfariasv/data/mfariasv/aligned_newZFV2/rawvar2/raw_variantsZF_NC_044999.1.vcf
21:15:20.517 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gpfs/rt/7.2/opt/gatk/4.1.6.0/bin/gatk-package-4.1.6.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Aug 16, 2020 9:15:20 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
21:15:20.727 INFO ValidateVariants - ------------------------------------------------------------
21:15:20.728 INFO ValidateVariants - The Genome Analysis Toolkit (GATK) v4.1.6.0
21:15:20.728 INFO ValidateVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
21:15:20.728 INFO ValidateVariants - Executing as mfariasv@login003 on Linux v3.10.0-957.5.1.el7.x86_64 amd64
21:15:20.728 INFO ValidateVariants - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_111-b14
21:15:20.728 INFO ValidateVariants - Start Date/Time: August 16, 2020 9:15:20 PM EDT
21:15:20.728 INFO ValidateVariants - ------------------------------------------------------------
21:15:20.728 INFO ValidateVariants - ------------------------------------------------------------
21:15:20.729 INFO ValidateVariants - HTSJDK Version: 2.21.2
21:15:20.729 INFO ValidateVariants - Picard Version: 2.21.9
21:15:20.729 INFO ValidateVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
21:15:20.729 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
21:15:20.729 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
21:15:20.729 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
21:15:20.729 INFO ValidateVariants - Deflater: IntelDeflater
21:15:20.729 INFO ValidateVariants - Inflater: IntelInflater
21:15:20.729 INFO ValidateVariants - GCS max retries/reopens: 20
21:15:20.730 INFO ValidateVariants - Requester pays: disabled
21:15:20.730 INFO ValidateVariants - Initializing engine
21:15:21.163 INFO FeatureManager - Using codec VCFCodec to read file file:///users/mfariasv/data/mfariasv/aligned_newZFV2/rawvar2/raw_variantsZF_NC_044999.1.vcf
21:15:21.287 INFO ValidateVariants - Done initializing engine
21:15:21.287 WARN ValidateVariants - IDS validation cannot be done because no DBSNP file was provided
21:15:21.287 WARN ValidateVariants - Other possible validations will still be performed
21:15:21.287 INFO ProgressMeter - Starting traversal
21:15:21.288 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
21:15:22.542 INFO ValidateVariants - Shutting down engine
[August 16, 2020 9:15:22 PM EDT] org.broadinstitute.hellbender.tools.walkers.variantutils.ValidateVariants done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=1911554048
***********************************************************************A USER ERROR has occurred: Input /users/mfariasv/data/mfariasv/aligned_newZFV2/rawvar2/raw_variantsZF_NC_044999.1.vcf fails strict validation of type ALL: one or more of the ALT allele(s) for the record at position NC_044999.1:474291 are not observed at all in the sample genotypes
***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace. -
Hi Genevieve Brandt (she/her) I checked the bam files I had generated by aligning my reads using
bwa mem -R '@RG\tID:firstrun\tSM:RSFV1A\tPL:illumina\tLB:lib1\tPU:005' -p newzf20/GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fa RSFV1A_L005_clean_u.fastq | samtools sort -o RSFV1A_005.bam -
and error came back
gatk ValidateSamFile -I RSFV1A_005.bam -R GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fa
Using GATK jar /gpfs/rt/7.2/opt/gatk/4.1.6.0/bin/gatk-package-4.1.6.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gpfs/rt/7.2/opt/gatk/4.1.6.0/bin/gatk-package-4.1.6.0-local.jar ValidateSamFile -I /users/mfariasv/data/mfariasv/aligned_newZFV2/firstrun/RSFV1A_005.bam -R /users/mfariasv/data/mfariasv/newzf20/GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fa
19:31:19.836 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gpfs/rt/7.2/opt/gatk/4.1.6.0/bin/gatk-package-4.1.6.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed Aug 19 19:31:19 EDT 2020] ValidateSamFile --INPUT /users/mfariasv/data/mfariasv/aligned_newZFV2/firstrun/RSFV1A_005.bam --REFERENCE_SEQUENCE /users/mfariasv/data/mfariasv/newzf20/GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fa --MODE VERBOSE --MAX_OUTPUT 100 --IGNORE_WARNINGS false --VALIDATE_INDEX true --INDEX_VALIDATION_STRINGENCY EXHAUSTIVE --IS_BISULFITE_SEQUENCED false --MAX_OPEN_TEMP_FILES 8000 --SKIP_MATE_VALIDATION false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
Aug 19, 2020 7:31:20 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
[Wed Aug 19 19:31:20 EDT 2020] Executing as mfariasv@login004 on Linux 3.10.0-957.5.1.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.6.0
ERROR::INVALID_TAG_NM:Record 26635, Read name K00188:204:HFM2HBBXX:5:1214:5852:42513, NM tag (nucleotide differences) in file [12] does not match reality [13]
ERROR::INVALID_TAG_NM:Record 112162, Read name K00188:204:HFM2HBBXX:5:2103:5152:37044, NM tag (nucleotide differences) in file [2] does not match reality [3]
ERROR::INVALID_TAG_NM:Record 119710, Read name K00188:204:HFM2HBBXX:5:1201:31923:46504, NM tag (nucleotide differences) in file [13] does not match reality [14]
...Maximum output of [100] errors reached.
[Wed Aug 19 19:32:29 EDT 2020] picard.sam.ValidateSamFile done. Elapsed time: 1.16 minutes.
Runtime.totalMemory()=5220335616
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Tool returned:
3I don't know how bwa created these errors.
Should I filter these reads?
Thank you!
-
Hi madzayasodara in this ValidateSamFile tutorial, we go through what these error messages mean. These errors will need to be addressed before you proceed with GATK.
"ERROR::INVALID_TAG_NM:Record 26635, Read name K00188:204:HFM2HBBXX:5:1214:5852:42513, NM tag (nucleotide differences) in file [12] does not match reality [13]"
It is an error with the NM tag which comes from the alignment. I would recommend doing this analysis with your BAM file, using ValidateSamFile to see if the errors existed there.
One thing to confirm would be that you are properly marking read groups, as well as using the same reference version during all your steps.
-
Hello Genevieve Brandt (she/her),
gotcha. Just few things I'm confused.
There's no bam/sam file before this one I tested.
Before that, I had fastq files I aligned with BWA.
I also noticed this in the tutorial you suggested checking out:
"We also see a
WARNING
about missingNM
tags. This is an alignment tag that is added by some but not all genome aligners, and is not used by the downstream tools that we care about, so you may decide to ignore this warning by addingIGNORE=MISSING_TAG_NM
from now on when you run ValidateSamFile on this file."So, I wonder if it's ok to do the same for INVALID_TAG_NM?
When I do
IGNORE=IGNORE_TAG_NM,
the bam file passes validation.Do you think that yes, the error I get when I do ValidateVariants on the vcf I called from these bam files still might relate to the NM tag error in the bam files?
Yes, I am checking if I did the red group correct and making sure I used same ref genome in all steps.
Thanks!
-
Hi madzayasodara, Yes, my mistake about running ValidateSamFile again, I think I was confused at which step you were looking at. It will most likely be fine to ignore the NM tag errors.
Please sign in to leave a comment.
7 comments