GATK4 validateSam Errors interpretation
b) Exact command used:
$gatk4 --java-options "-Xms10G -Xmx32G -Djava.io.tmpdir=./tmp" \
ValidateSamFile \
-I ${SM}_analysisReady.bam \
-O ${SM}_analysis_ready_validation_report \
-R $REF \
--IS_BISULFITE_SEQUENCED false \
--VERBOSITY ERROR
c) Entire error log:
-
Hi mehar,
We have a tutorial regarding interpreting these issues, have you seen this resource?
You can also read more about BAM/SAM format and whether or not this is an error message you can ignore: https://gatk.broadinstitute.org/hc/en-us/articles/360035890791-SAM-or-BAM-or-CRAM-Mapped-sequence-data-formats
Best,
Genevieve
-
Hi mehar,
It is best to fix all warnings and error messages, however, sometimes warnings can be ignored. Most errors though will cause issues downstream.
Could you isolate where this issue is introduced? You can run ValidateSamFile before and after each of your steps to determine where something went wrong.
Genevieve
-
Hi Brandt,
Here is some traceback info starting from the first step:
The sample has fastq files from multiple lanes. Paired-end reads from each lane are aligned independently with bwa:
bwa mem -t 16 -R "@RG\tID:${sample}\tLB:${sample}\tSM:${sample}\tPL:ILLUMINA\tPU:NG" genomic.fasta ${prefix}_1.fastq.gz ${prefix}_2.fastq.gz | samtools view -1 -@16 -> ./${prefix}_cf4_aligned_reads.bam
Then all the bams from different lanes, listed in `bams.list` are merged:
params=$(cat bams.list | while read bam; do printf "I=$bam "; done)
$gatk4 --java-options "-Xms10G -Xmx32G -Djava.io.tmpdir=./tmp" MergeSamFiles $params USE_THREADING=true MAX_RECORDS_IN_RAM=1000000 CREATE_INDEX=true SO=coordinate O=${sample}_merged_files.bamNow ValidateSam is run on the merged bam file, which has the below error (only one line shown for brevity and the others are similar INVALID_TAG_NM errors):
ERROR::INVALID_TAG_NM:Record 35004695, Read name HISEQ:11:D1Y4UACXX:3:2307:4899:47176, NM tag (nucleotide differences) in file [0] does not match reality [3]
After merging, GATK MarkDuplicates is run, and ValidateSam is again run on the duplicates marked bam file which gives the same error:
ERROR::INVALID_TAG_NM:Record 35004695, Read name HISEQ:11:D1Y4UACXX:3:2307:4899:47176, NM tag (nucleotide differences) in file [0] does not match reality [3]
And after this step, without watching the erorrs, the remaining steps, SortSam, BaseRecalibrator, LeftAlignIndels,FixMateInformation and SetNmMdAndUqTags are run sequentially and ValidateSam is again run on the bam coming from the final step SetNmMdAndUqTags
This gave the error which was posted in the original post. To add more complexity, two different errors from two different samples as shown in the original post. These errors are different from the errors from MergeSamFiles and MarkDuplicates bam file as shown above.
Can you reckon any weird behaviour from this info or a step where it could be fixed? Thanks.
-
Hi mehar,
I believe FixMateInformation should have fixed your original MISMATCH_MATE_CIGAR_STRING error. Could you please show the stack trace from that command as well as the ValidateSamFile stack trace from the output file?
Genevieve
-
So here is more traceback info. Here is the workflow used to trace the issue:
Bwa mem -> MergeSam-> FixMateInformation -> SetNmMdAndUqTags -> MarkDuplicates -> SortSam -> FixMateInformation -> SetNmMdAndUqTags -> BaseRecalibrator&ApplyBQSR -> LeftAlignIndels -> FixMateInformation -> SetNmMdAndUqTags -> HaplotypeCaller
After MergeSam using Picard, i have the below error (many similar, only one shown below) in ValidateSam report:
ERROR::INVALID_TAG_NM:Record 24999726, Read name ST-E00251:586:HMYG3CCXY:8:2103:11302:71594, NM tag (nucleotide differences) in file [0] does not match reality [2]
Then FixMateInformation -> SetNmMdAndUqTags is run and this error is removed.
Then MarkDuplicates -> SortSam is run. Since it is recommended here (https://gatk.broadinstitute.org/hc/en-us/articles/360040096212), FixMateInformation & SetNmMdAndUqTags are run again (2nd time) b4 BQSR and no error in `validateSam` report at this stage.
Then BaseRecalibrator&ApplyBQSR, no errors in the validateSam report
Then after LeftAlignIndels, validateSam has the below error(s), only one shown but many of the same error message:
ERROR::MISMATCH_MATE_CIGAR_STRING:Record 932539, Read name ST-E00251:586:HMYG3CCXY:4:1205:14763:3717, Mate CIGAR string does not match CIGAR string of mate
Then FixMateInformation -> SetNmMdAndUqTags is run for the 3rd time followed by validateSam check. The error still remains.
So it appears that the error is from `LeftAlignIndels` step. Despite the errors, `HaplotypeCaller` succeeded.
1) how can the error be removed and what kind of cons would these have if ignored and what kind of downstream analysis will be affected. HaplotypeCaller works well. But how about SV detection methods?
2) as per my experience and discussions with colleagues and collaborators, it isn't a standard practice to run `validateSam` and check for errors or warnings, as long as the bam goes through all steps, we don't bother to trace back. Also, these bams work well with external programs.
With this experience i am confident that most of the bams will have some sort of errors/warning if users run `validateSam` on their bams. What would be the suggestions for the community in terms of best practices going forward?
-
Hi mehar,
This error from ValidateSamFile is most likely coming up after LeftAlignIndels because LeftAlignIndels will standardize an indel's location by moving it to the left-most coordinate that is consistent with data. So, the cigar will change if the indel is moved. It does not update the mate, which leads to the mate cigars not matching.
- It is confusing that this error would still exist after FixMateInformation because FixMateInformation should fix this. Could you post your command for FixMateInformation at that step, and the stack trace? We want to make sure --ADD_MATE_CIGAR is set to TRUE and that there are no other issues.
- ValidateSam is really only necessary to fix issues that might come up in other steps of the pipeline to determine what is causing them. Even though these bams are working now, because this is an ERROR, you will probably want to fix it so that you don't get future issues.
Genevieve
-
Here it is,--ADD_MATE_CIGAR is set to TRUE. And still the error:
Using GATK jar /gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms1
0G -Xmx32G -Djava.io.tmpdir=./tmp -jar /gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar FixMateInformation -R genomic.fasta -I recal_leftaligned_reads.bam -O fixed_recal_leftaligned_reads.bam --ADD_MATE_CIGAR true --ASSUME_SORTED true --SORT_ORDER coordinate --VALIDATION_STRINGENCY SILENT --MAX_RECORDS_IN_RAM 500000 --IGNORE_MISSING_MATES true --VERBOSITY ERROR
02:12:11.278 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file://gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed Feb 10 02:12:11 EET 2021] FixMateInformation --INPUT recal_leftaligned_reads.bam --OUTPUT fixed_recal_leftaligned_reads.bam --SORT_ORDER coordinate --ASSUME_SORTED true --ADD_MATE_CIGAR true --IGNORE_MISSING_MATES true --VERBOSITY ERROR --VALIDATION_STRINGENCY SILENT --MAX_RECORDS_IN_RAM 500000 --REFERENCE_SEQUENCE genomic.fasta --QUIET false --COMPRESSION_LEVEL 2 --CREATE_INDEX false --CREATE_M
D5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
Feb 10, 2021 2:12:11 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
[Wed Feb 10 02:12:11 EET 2021] Executing as user@r18c41.bullx on Linux 3.10.0-1062.33.1.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_241-b07; Deflater:
Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.9.0
[Wed Feb 10 02:55:05 EET 2021] picard.sam.FixMateInformation done. Elapsed time: 42.90 minutes. -
Hi mehar, thank you for that information.
We are thinking that this issue is coming from secondary alignments, because FixMateInformation does not operate on secondary alignments.
To confirm that this is the case, could you run PrintReads--read-filter NotSecondaryAlignmentReadFilter on your output from FixMateInformation. If the resulting BAM passes ValidateSamFile, then it looks like the secondary alignments are the issue.
If you do not need or want secondary alignments, you can run LeftAlignIndels with the argument --read-filter NotSecondaryAlignmentReadFilter. If you want secondary alignments and find they are the issue, then the error can be ignored with --IGNORE --MISMATCH_MATE_CIGAR_STRING when running ValidateSamFile.
Let me know what you find.
Best,
Genevieve
-
HOw about the other error in the original question
`ERROR::MATE_CIGAR_STRING_INVALID_PRESENCE`
Is it also from LeftAlignIndels?
-
I don't see that error coming up when you did the in depth analysis in your more recent comment. Is it still an issue?
-
It is from a different sample.
-
It could be from LeftAlignIndels, you can try the same diagnosis as I recommended above to determine if the secondary alignments caused that issue as well.
If not, please look into each step more closely to see where that error was introduced.
Please sign in to leave a comment.
12 comments