Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GATK4 validateSam Errors interpretation

0

12 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi mehar,

    We have a tutorial regarding interpreting these issues, have you seen this resource? 

    https://gatk.broadinstitute.org/hc/en-us/articles/360035891231-Errors-in-SAM-or-BAM-files-can-be-diagnosed-with-ValidateSamFile

    You can also read more about BAM/SAM format and whether or not this is an error message you can ignore: https://gatk.broadinstitute.org/hc/en-us/articles/360035890791-SAM-or-BAM-or-CRAM-Mapped-sequence-data-formats

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi mehar,

    It is best to fix all warnings and error messages, however, sometimes warnings can be ignored. Most errors though will cause issues downstream.

    Could you isolate where this issue is introduced? You can run ValidateSamFile before and after each of your steps to determine where something went wrong.

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    mehar

    Genevieve Brandt (she/her)

    Hi Brandt,

    Here is some traceback info starting from the first step:

    The sample has fastq files from multiple lanes. Paired-end reads from each lane are aligned independently with bwa:

    bwa mem -t 16 -R "@RG\tID:${sample}\tLB:${sample}\tSM:${sample}\tPL:ILLUMINA\tPU:NG" genomic.fasta ${prefix}_1.fastq.gz ${prefix}_2.fastq.gz | samtools view -1 -@16 -> ./${prefix}_cf4_aligned_reads.bam

    Then all the bams from different lanes, listed in `bams.list` are merged:

    params=$(cat bams.list | while read bam; do printf "I=$bam "; done)

    $gatk4 --java-options "-Xms10G -Xmx32G -Djava.io.tmpdir=./tmp" MergeSamFiles $params USE_THREADING=true MAX_RECORDS_IN_RAM=1000000 CREATE_INDEX=true SO=coordinate O=${sample}_merged_files.bam

    Now ValidateSam is run on the merged bam file, which has the below error (only one line shown for brevity and the others are similar INVALID_TAG_NM errors):

    ERROR::INVALID_TAG_NM:Record 35004695, Read name HISEQ:11:D1Y4UACXX:3:2307:4899:47176, NM tag (nucleotide differences) in file [0] does not match reality [3]

    After merging, GATK MarkDuplicates is run, and ValidateSam is again run on the duplicates marked bam file which gives the same error:

    ERROR::INVALID_TAG_NM:Record 35004695, Read name HISEQ:11:D1Y4UACXX:3:2307:4899:47176, NM tag (nucleotide differences) in file [0] does not match reality [3]

    And after this step, without watching the erorrs, the remaining steps, SortSam, BaseRecalibrator, LeftAlignIndels,FixMateInformation and SetNmMdAndUqTags are run sequentially and ValidateSam is again run on the bam coming from the final step SetNmMdAndUqTags

    This gave the error which was posted in the original post. To add more complexity, two different errors from two different samples as shown in the original post. These errors are different from the errors from MergeSamFiles and MarkDuplicates bam file as shown above.

    Can you reckon any weird behaviour from this info or a step where it could be fixed? Thanks.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi mehar,

    I believe FixMateInformation should have fixed your original MISMATCH_MATE_CIGAR_STRING error. Could you please show the stack trace from that command as well as the ValidateSamFile stack trace from the output file?

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    mehar

    So here is more traceback info. Here is the workflow used to trace the issue:

    Bwa mem -> MergeSam-> FixMateInformation -> SetNmMdAndUqTags -> MarkDuplicates -> SortSam -> FixMateInformation -> SetNmMdAndUqTags -> BaseRecalibrator&ApplyBQSR -> LeftAlignIndels -> FixMateInformation -> SetNmMdAndUqTags -> HaplotypeCaller

    After MergeSam using Picard, i have the below error (many similar, only one shown below) in ValidateSam report:

    ERROR::INVALID_TAG_NM:Record 24999726, Read name ST-E00251:586:HMYG3CCXY:8:2103:11302:71594, NM tag (nucleotide differences) in file [0] does not match reality [2]

    Then FixMateInformation -> SetNmMdAndUqTags is run and this error is removed.

    Then MarkDuplicates -> SortSam is run. Since it is recommended here (https://gatk.broadinstitute.org/hc/en-us/articles/360040096212), FixMateInformation & SetNmMdAndUqTags are run again (2nd time) b4 BQSR  and no error in `validateSam` report at this stage.

    Then BaseRecalibrator&ApplyBQSR, no errors in the validateSam report

    Then after LeftAlignIndels, validateSam has the below error(s), only one shown but many of the same error message:

    ERROR::MISMATCH_MATE_CIGAR_STRING:Record 932539, Read name ST-E00251:586:HMYG3CCXY:4:1205:14763:3717, Mate CIGAR string does not match CIGAR string of mate

    Then FixMateInformation -> SetNmMdAndUqTags is run for the 3rd time followed by validateSam check. The error still remains.

    So it appears that the error is from `LeftAlignIndels` step. Despite the errors, `HaplotypeCaller` succeeded.

    1) how can the error be removed and what kind of cons would these have if ignored and what kind of downstream analysis will be affected. HaplotypeCaller works well. But how about SV detection methods?

    2) as per my experience and discussions with colleagues and collaborators, it isn't a standard practice to run `validateSam` and check for errors or warnings, as long as the bam goes through all steps, we don't bother to trace back. Also, these bams work well with external programs.

    With this experience i am confident that most of the bams will have some sort of errors/warning if users run `validateSam` on their bams. What would be the suggestions for the community in terms of best practices going forward?

     

     

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi mehar,

    This error from ValidateSamFile is most likely coming up after LeftAlignIndels because LeftAlignIndels will standardize an indel's location by moving it to the left-most coordinate that is consistent with data. So, the cigar will change if the indel is moved. It does not update the mate, which leads to the mate cigars not matching.

    1. It is confusing that this error would still exist after FixMateInformation because FixMateInformation should fix this. Could you post your command for FixMateInformation at that step, and the stack trace? We want to make sure --ADD_MATE_CIGAR is set to TRUE and that there are no other issues.
    2. ValidateSam is really only necessary to fix issues that might come up in other steps of the pipeline to determine what is causing them. Even though these bams are working now, because this is an ERROR, you will probably want to fix it so that you don't get future issues.

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    mehar

    Here it is,--ADD_MATE_CIGAR is set to TRUE. And still the error:

    Using GATK jar /gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar

    Running:

        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms1

    0G -Xmx32G -Djava.io.tmpdir=./tmp -jar /gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar FixMateInformation -R genomic.fasta -I recal_leftaligned_reads.bam -O fixed_recal_leftaligned_reads.bam --ADD_MATE_CIGAR true --ASSUME_SORTED true --SORT_ORDER coordinate --VALIDATION_STRINGENCY SILENT --MAX_RECORDS_IN_RAM 500000 --IGNORE_MISSING_MATES true --VERBOSITY ERROR

    02:12:11.278 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file://gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.so

    [Wed Feb 10 02:12:11 EET 2021] FixMateInformation --INPUT recal_leftaligned_reads.bam --OUTPUT fixed_recal_leftaligned_reads.bam --SORT_ORDER coordinate --ASSUME_SORTED true --ADD_MATE_CIGAR true --IGNORE_MISSING_MATES true --VERBOSITY ERROR --VALIDATION_STRINGENCY SILENT --MAX_RECORDS_IN_RAM 500000 --REFERENCE_SEQUENCE genomic.fasta --QUIET false --COMPRESSION_LEVEL 2 --CREATE_INDEX false --CREATE_M

    D5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false

    Feb 10, 2021 2:12:11 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine

    INFO: Failed to detect whether we are running on Google Compute Engine.

    [Wed Feb 10 02:12:11 EET 2021] Executing as user@r18c41.bullx on Linux 3.10.0-1062.33.1.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_241-b07; Deflater:

    Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.9.0

    [Wed Feb 10 02:55:05 EET 2021] picard.sam.FixMateInformation done. Elapsed time: 42.90 minutes.
    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi mehar, thank you for that information.

    We are thinking that this issue is coming from secondary alignments, because FixMateInformation does not operate on secondary alignments. 

    To confirm that this is the case, could you run PrintReads--read-filter NotSecondaryAlignmentReadFilter on your output from FixMateInformation. If the resulting BAM passes ValidateSamFile, then it looks like the secondary alignments are the issue. 

    If you do not need or want secondary alignments, you can run LeftAlignIndels with the argument --read-filter NotSecondaryAlignmentReadFilter. If you want secondary alignments and find they are the issue, then the error can be ignored with --IGNORE --MISMATCH_MATE_CIGAR_STRING when running ValidateSamFile.

    Let me know what you find.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    mehar

    HOw about the other error in the original question

    `ERROR::MATE_CIGAR_STRING_INVALID_PRESENCE`

    Is it also from LeftAlignIndels?

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    I don't see that error coming up when you did the in depth analysis in your more recent comment. Is it still an issue?

    0
    Comment actions Permalink
  • Avatar
    mehar

    It is from a different sample.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    It could be from LeftAlignIndels, you can try the same diagnosis as I recommended above to determine if the secondary alignments caused that issue as well.

    If not, please look into each step more closely to see where that error was introduced.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk