FixMateInformation does not solve ERROR:MATE_NOT_FOUND
Dear,
My ultimate goal is to use GATK HaplotypeCaller to identify the number of mutations in a clonal non-model insect by using the paired-end reads of one offspring to the assembled genome of its mother.
I am however experiencing issues for validating the SAM with ValidateSamFile.
Here is how I prepared the data prior to GATK.
First, I repeat masked the assembled genome (RepeatMasker/4.1.1):
RepeatMasker $ASSEMBLY -pa ${SLURM_CPUS_PER_TASK} -u -poly -s -dir ${WORKDIR}/GENOME
I mapped the reads of the offspring using BWA-mem/0.7.10 with SAMtools/1.9:
bwa index -a bwtsw $REPEAT_MASKED_ASSEMBLY
bwa mem -t ${SLURM_CPUS_PER_TASK} $REPEAT_MASKED_ASSEMBLY $ILLUMINA_R1 $ILLUMINA_R2 > ${WORKDIR}/ALIGN/aln.sam
# Sorting the BAM (by coordinates) and extracting only mapped reads (F 4)
samtools view -F 4 -S -b ${WORKDIR}/ALIGN/aln.sam > ${WORKDIR}/ALIGN/aln.bam
samtools sort ${WORKDIR}/ALIGN/aln.bam -o ${WORKDIR}/ALIGN/aln_mapped.bam
samtools index ${WORKDIR}/ALIGN/aln_mapped.bam
Using GATK/4.1.9.0 in a .sif container, I ran these commands:
samtools faidx $REPEAT_MASKED_ASSEMBLY
gatk --java-options "-Xmx90G" CreateSequenceDictionary -R $REPEAT_MASKED_ASSEMBLY
Then I started validating the SAM:
gatk --java-options "-Xmx90G" ValidateSamFile -I $BAM -R $REPEAT_MASKED_ASSEMBLY
This command returned first that there were MISSING_READ_GROUP, which I solved by:
gatk --java-options "-Xmx90G" AddOrReplaceReadGroups -I $BAM -O ${WORKDIR}/ALIGN/fixed_group.bam -R $REPEAT_MASKED_ASSEMBLY --RGLB lib1 --RGPL illumina --RGPU unit1 --RGSM ASP1H
samtools index ${WORKDIR}/ALIGN/fixed_group.bam
Running again ValidateSamFile, by ignoring ERROR::INVALID_TAG_NM.......NM tag (nucleotide differences) in file [0] does not match reality [1].
This was suggested here: https://gatk.broadinstitute.org/hc/en-us/community/posts/360072305352-Which-exact-info-does-BaseRecalibrator-uses-from-a-vcf-
gatk --java-options "-Xmx90G" ValidateSamFile -IGNORE INVALID_TAG_NM -I $FIXED_BAM -R $REPEAT_MASKED_ASSEMBLY -M SUMMARY
The above command returned the following error: ERROR::MATE_NOT_FOUND
I attempted to solve it by using:
gatk --java-options "-Xmx90G" FixMateInformation -I ${WORKDIR}/ALIGN/fixed_group.bam -O ${WORKDIR}/ALIGN/fixed_group_mate.bam
samtools index ${WORKDIR}/ALIGN/fixed_group_mate.bam
However, running again ValidateSamFile on the fixed BAM file with this command:
gatk --java-options "-Xmx90G" ValidateSamFile -IGNORE INVALID_TAG_NM -I $FIXED_BAM -R $REPEAT_MASKED_ASSEMBLY -M SUMMARY
still outputs the same error:
## HISTOGRAM java.lang.String
Error Type Count
ERROR:MATE_NOT_FOUND 1179845
Tool returned:
3
I am really lost about how getting this solved...
Thanks in advance for your help,
Simon
-
Hello Simon Hellemans,
- Are you running ValidateSamFile in MODE=VERBOSE?
- When was the first time you found the issue of the MATE_NOT_FOUND? Did it exist the first time you ran ValidateSamFile?
-
Dear Genevieve,
Thank you for your reply.
I was indeed running it in VERBOSE mode the first time, so I did not see this error initially.
I just ran ValidateSam on the unfixed BAM in SUMMARY mode, and here are the results:
gatk --java-options "-Xmx90G" ValidateSamFile -I $BAM -R $REPEAT_MASKED_ASSEMBLY -M SUMMARY## HISTOGRAM java.lang.String
Error Type Count
ERROR:INVALID_TAG_NM 5251770
ERROR:MATE_NOT_FOUND 1179845
ERROR:MISSING_READ_GROUP 1
WARNING:RECORD_MISSING_READ_GROUP 250053764Tool returned:
2 -
Hi Simon Hellemans, I see, thank you for clarifying that. It looks like there is a problem with the mates in your file and FixMateInformation might not be able to fix it.
You could have lost the mates at one point in your analysis by subsetting a file by intervals or downsampling with a tool that was not mate aware. If so, you should go back and fix these steps.
Here is another post on the GATK forum where someone troubleshooted this issue, which might be helpful to you: https://gatk.broadinstitute.org/hc/en-us/community/posts/360060977512-GATK-Picard-does-not-detect-mates-in-paired-end-BAM
-
Dear Genevieve,I did not subsample the reads prior aligning them.Also, I just confirmed again using fastp that reads are indeed all mated. Here is the corresponding extract of the fastp output:Read1 after filtering:total reads: 117697682Read2 after filtering:total reads: 117697682As per data structure concerning the fix suggested in the link, it would seem this does not apply here.Numbers 1 and 2 indeed appear in fastq reads:### R1@K00308:50:HL7KVBBXX:7:1101:2828:1226 1:N:0:ATTACTCG+AGGCTATACATACAACAAAGATAGTAGAAAATATACTTAGAAGAAGGACTGAAAGAAAAATTGAGGATGTACTTGGAGAAGGTCAGTTTGGATTTAGAAGAGGAAAAGGAACTAGAGATGCGATTGGGATGATGAGAATAATAGCAGAACAAACTTTGG+AAFFFFJJFFJJJJJA7FFJJFJJAJJJJJJJ-<-FJJFAF-7FFJJJ-FFFJJJFJJF7FJFJAJFJJJJJJFJF<A<<-AJJ7FJFJJJJ<JFJFJF<AJFF7JJJAJFFFA<--77<F7F77FAF---FFJFFA<JJJ--FFJAF7A<### R2@K00308:50:HL7KVBBXX:7:1101:2828:1226 2:N:0:ATTACTCG+AGGCTATACCAATCAATACCACTTATCTTAAGGATCTGCATTAATTTGGTCCAGTTTACACGGTCAAATGCCTTCTGCGAGTCTATGAAGCAAATACACAGTTCTTCACCCATCTCCAAAGTTTGTTCTGCTATTATTCTCACCATCCCAATCGCACCT+AAFAFJF-FJJJJFAJFJJF-FJFFAJFJFAFFFJF7AF-F7<AJFJAJJJJJFJFJAJ-J<AFJJJ-FFAFFJJJFJFJFFJJJ<77AAFAA<AJJJFF---7-A7FAJJ7-FFF-7FFFFJAA<FFJ-FAJA-)7-<77F<7A-<7)-)While it is not visible in the SAM (just for clarity purpose: please note that reads above are not the ones displayed below):K00308:50:HL7KVBBXX:8:2228:3823:48913 163 NODE_55712_length_5801_cov_34.588826 1774 60 150M = 2073 450 NCCTCTCTATGTTAACCAATGTTAACCCACACGTATTAAGCGAATTTCACAAGCATTTTATGGCACTAATATTATAATTATAATTCACCTTTCATAACTAACATTATTTTATTTCCACCGTATGGGTTAATTGTAGAATTTTCGGAAGTT #AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJFFFJJJJFFJJ<JF<FJF NM:i:1 MD:Z:0A149 AS:i:149 XS:i:23Is my understanding correct?Best regards,Simon
-
Simon Hellemans how many total reads do you have? Are all of your reads without mates (1179845) or just some?
I am wondering if one of your pre-processing steps had an error that you did not notice and so many of your reads did not align properly. Could you re-run your pre-processing and confirm that there were no issues? Please also test that this issue persists without using RepeatMasker, since I don't know about that tool, I cannot determine if it is causing these issues.
-
Dear Genevieve,
Thanks you for your reply.
As mentioned in my previous post, R1 and R2 both contain 117697682 reads.
As suggested, I re-ran everything without repeat-masking the genome.
At first, I still obtained the ERROR:MATE_NOT_FOUND.
## HISTOGRAM java.lang.String
Error Type Count
ERROR:MATE_NOT_FOUND 89302
ERROR:MISSING_READ_GROUP 1
WARNING:RECORD_MISSING_READ_GROUP 236359854
Tool returned:
2The above was obtained by converting the SAM to BAM using samtools view -F 4 -S -b as previously. Secondly, I converted instead the SAM to BAM without keeping only mapped, using samtools view -S -b.
This resulted in the ERROR:MATE_NOT_FOUND not occurring!
## HISTOGRAM java.lang.String
Error Type Count
ERROR:MISSING_READ_GROUP 1
WARNING:RECORD_MISSING_READ_GROUP 236568066
Tool returned:
2By using AddOrReplaceReadGroups, it resulted in no more errors or warnings appearing, so that is great!
I tried the same approach with the repeat-masked version of the genome I was previously reporting but unfortunately ERROR:MATE_NOT_FOUND still occurs.
I guess having the genome not repeat-masked at this step would be ok, and that I should try to implement that later after variant calling?
Best regards,
Simon -
Hi Simon Hellemans, I am not familiar with the repeat masker tool so I am not sure when you should use it in your analysis.
In terms of GATK, you should see the issue resolved when keeping unmapped reads because anytime you subset the file (such as only keeping mapped reads) you are going to lose some mates. Most tools can handle missing mates but it does indicate something weird happened in your file.
Please sign in to leave a comment.
7 comments