Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Which exact info does BaseRecalibrator uses from a vcf?

0

7 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi madzayasodara, you can use ValidateVariants to validate a VCF file for use with GATK. The VCF input to BaseRecalibrator is intended to be known sites: https://gatk.broadinstitute.org/hc/en-us/articles/360047217531-BaseRecalibrator

    1
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi, madzayasodara, there may be an error with the file you used with HaplotypeCaller, could you check your BAM file with ValidateSamFile? https://gatk.broadinstitute.org/hc/en-us/articles/360035891231-Errors-in-SAM-or-BAM-files-can-be-diagnosed-with-ValidateSamFile

    Also, the error message from ValidateVariants shows you exactly where the problem is located, it may be helpful to look closer at this location to find what is happening:

    "one or more of the ALT allele(s) for the record at position NC_044999.1:474291 are not observed at all in the sample genotypes"

    1
    Comment actions Permalink
  • Avatar
    madzayasodara

    Thank you :)

    This is weird my vcf check fails even when I call variants using HaplotypeCaller

    This is the cmd I use to call 

    gatk HaplotypeCaller -R GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fa -I realignmatched/RSFV1A_match.bam 

    ...

    -I realignmatched/RSFV1Z_match.bam -L NC_044999.1 --stand-call-conf 30 -mbq 20 -O raw_variantsZF_NC_044999.1.vcf

    this is the cmd I use to validate

    gatk ValidateVariants -R GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fa -V raw_variantsZF_NC_044999.1.vcf

    This is the error I get

    Using GATK jar /gpfs/rt/7.2/opt/gatk/4.1.6.0/bin/gatk-package-4.1.6.0-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gpfs/rt/7.2/opt/gatk/4.1.6.0/bin/gatk-package-4.1.6.0-local.jar ValidateVariants -R /users/mfariasv/data/mfariasv/newzf20/GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fa -V /users/mfariasv/data/mfariasv/aligned_newZFV2/rawvar2/raw_variantsZF_NC_044999.1.vcf
    21:15:20.517 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gpfs/rt/7.2/opt/gatk/4.1.6.0/bin/gatk-package-4.1.6.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    Aug 16, 2020 9:15:20 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    21:15:20.727 INFO ValidateVariants - ------------------------------------------------------------
    21:15:20.728 INFO ValidateVariants - The Genome Analysis Toolkit (GATK) v4.1.6.0
    21:15:20.728 INFO ValidateVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
    21:15:20.728 INFO ValidateVariants - Executing as mfariasv@login003 on Linux v3.10.0-957.5.1.el7.x86_64 amd64
    21:15:20.728 INFO ValidateVariants - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_111-b14
    21:15:20.728 INFO ValidateVariants - Start Date/Time: August 16, 2020 9:15:20 PM EDT
    21:15:20.728 INFO ValidateVariants - ------------------------------------------------------------
    21:15:20.728 INFO ValidateVariants - ------------------------------------------------------------
    21:15:20.729 INFO ValidateVariants - HTSJDK Version: 2.21.2
    21:15:20.729 INFO ValidateVariants - Picard Version: 2.21.9
    21:15:20.729 INFO ValidateVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    21:15:20.729 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    21:15:20.729 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    21:15:20.729 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    21:15:20.729 INFO ValidateVariants - Deflater: IntelDeflater
    21:15:20.729 INFO ValidateVariants - Inflater: IntelInflater
    21:15:20.729 INFO ValidateVariants - GCS max retries/reopens: 20
    21:15:20.730 INFO ValidateVariants - Requester pays: disabled
    21:15:20.730 INFO ValidateVariants - Initializing engine
    21:15:21.163 INFO FeatureManager - Using codec VCFCodec to read file file:///users/mfariasv/data/mfariasv/aligned_newZFV2/rawvar2/raw_variantsZF_NC_044999.1.vcf
    21:15:21.287 INFO ValidateVariants - Done initializing engine
    21:15:21.287 WARN ValidateVariants - IDS validation cannot be done because no DBSNP file was provided
    21:15:21.287 WARN ValidateVariants - Other possible validations will still be performed
    21:15:21.287 INFO ProgressMeter - Starting traversal
    21:15:21.288 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
    21:15:22.542 INFO ValidateVariants - Shutting down engine
    [August 16, 2020 9:15:22 PM EDT] org.broadinstitute.hellbender.tools.walkers.variantutils.ValidateVariants done. Elapsed time: 0.03 minutes.
    Runtime.totalMemory()=1911554048
    ***********************************************************************

    A USER ERROR has occurred: Input /users/mfariasv/data/mfariasv/aligned_newZFV2/rawvar2/raw_variantsZF_NC_044999.1.vcf fails strict validation of type ALL: one or more of the ALT allele(s) for the record at position NC_044999.1:474291 are not observed at all in the sample genotypes

    ***********************************************************************
    Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

    0
    Comment actions Permalink
  • Avatar
    madzayasodara

    Hi Genevieve Brandt (she/her) I checked the bam files I had generated by aligning my reads using 

    bwa mem -R '@RG\tID:firstrun\tSM:RSFV1A\tPL:illumina\tLB:lib1\tPU:005' -p newzf20/GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fa RSFV1A_L005_clean_u.fastq | samtools sort -o RSFV1A_005.bam -

    and error came back 

    gatk ValidateSamFile -I RSFV1A_005.bam -R GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fa


    Using GATK jar /gpfs/rt/7.2/opt/gatk/4.1.6.0/bin/gatk-package-4.1.6.0-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gpfs/rt/7.2/opt/gatk/4.1.6.0/bin/gatk-package-4.1.6.0-local.jar ValidateSamFile -I /users/mfariasv/data/mfariasv/aligned_newZFV2/firstrun/RSFV1A_005.bam -R /users/mfariasv/data/mfariasv/newzf20/GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fa
    19:31:19.836 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gpfs/rt/7.2/opt/gatk/4.1.6.0/bin/gatk-package-4.1.6.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    [Wed Aug 19 19:31:19 EDT 2020] ValidateSamFile --INPUT /users/mfariasv/data/mfariasv/aligned_newZFV2/firstrun/RSFV1A_005.bam --REFERENCE_SEQUENCE /users/mfariasv/data/mfariasv/newzf20/GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fa --MODE VERBOSE --MAX_OUTPUT 100 --IGNORE_WARNINGS false --VALIDATE_INDEX true --INDEX_VALIDATION_STRINGENCY EXHAUSTIVE --IS_BISULFITE_SEQUENCED false --MAX_OPEN_TEMP_FILES 8000 --SKIP_MATE_VALIDATION false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
    Aug 19, 2020 7:31:20 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    [Wed Aug 19 19:31:20 EDT 2020] Executing as mfariasv@login004 on Linux 3.10.0-957.5.1.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.6.0
    ERROR::INVALID_TAG_NM:Record 26635, Read name K00188:204:HFM2HBBXX:5:1214:5852:42513, NM tag (nucleotide differences) in file [12] does not match reality [13]
    ERROR::INVALID_TAG_NM:Record 112162, Read name K00188:204:HFM2HBBXX:5:2103:5152:37044, NM tag (nucleotide differences) in file [2] does not match reality [3]
    ERROR::INVALID_TAG_NM:Record 119710, Read name K00188:204:HFM2HBBXX:5:1201:31923:46504, NM tag (nucleotide differences) in file [13] does not match reality [14]
    ...

    Maximum output of [100] errors reached.
    [Wed Aug 19 19:32:29 EDT 2020] picard.sam.ValidateSamFile done. Elapsed time: 1.16 minutes.
    Runtime.totalMemory()=5220335616
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    Tool returned:
    3

    I don't know how bwa created these errors.

    Should I filter these reads?

    Thank you!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi madzayasodara in this ValidateSamFile tutorial, we go through what these error messages mean. These errors will need to be addressed before you proceed with GATK. 

    "ERROR::INVALID_TAG_NM:Record 26635, Read name K00188:204:HFM2HBBXX:5:1214:5852:42513, NM tag (nucleotide differences) in file [12] does not match reality [13]"

    It is an error with the NM tag which comes from the alignment. I would recommend doing this analysis with your BAM file, using ValidateSamFile to see if the errors existed there.

    One thing to confirm would be that you are properly marking read groups, as well as using the same reference version during all your steps.

     

    0
    Comment actions Permalink
  • Avatar
    madzayasodara

    Hello Genevieve Brandt (she/her),

    gotcha. Just few things I'm confused.

    There's no bam/sam file before this one I tested. 

    Before that, I had fastq files I aligned with BWA.

    I also noticed this in the tutorial you suggested checking out:

    "We also see a WARNING about missing NM tags. This is an alignment tag that is added by some but not all genome aligners, and is not used by the downstream tools that we care about, so you may decide to ignore this warning by adding IGNORE=MISSING_TAG_NM from now on when you run ValidateSamFile on this file."

    So, I wonder if it's ok to do the same for INVALID_TAG_NM?

    When I do IGNORE=IGNORE_TAG_NM, the bam file passes validation.

    Do you think that yes, the error I get when I do ValidateVariants on the vcf I called from these bam files still might relate to the NM tag error in the bam files?

    Yes, I am checking if I did the red group correct and making sure I used same ref genome in all steps. 

    Thanks!

     

     

     

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi madzayasodara, Yes, my mistake about running ValidateSamFile again, I think I was confused at which step you were looking at. It will most likely be fine to ignore the NM tag errors.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk