Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

FixMateInformation does not solve ERROR:MATE_NOT_FOUND

1

7 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hello Simon Hellemans,

    1. Are you running ValidateSamFile in MODE=VERBOSE?
    2. When was the first time you found the issue of the MATE_NOT_FOUND? Did it exist the first time you ran ValidateSamFile?
    0
    Comment actions Permalink
  • Avatar
    Simon Hellemans

    Dear Genevieve,

    Thank you for your reply.

    I was indeed running it in VERBOSE mode the first time, so I did not see this error initially.

    I just ran ValidateSam on the unfixed BAM in SUMMARY mode, and here are the results:
    gatk --java-options "-Xmx90G" ValidateSamFile -I $BAM -R $REPEAT_MASKED_ASSEMBLY -M SUMMARY

    ## HISTOGRAM    java.lang.String
    Error Type      Count
    ERROR:INVALID_TAG_NM    5251770
    ERROR:MATE_NOT_FOUND    1179845
    ERROR:MISSING_READ_GROUP        1
    WARNING:RECORD_MISSING_READ_GROUP       250053764

    Tool returned:
    2

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Simon Hellemans, I see, thank you for clarifying that. It looks like there is a problem with the mates in your file and FixMateInformation might not be able to fix it. 

    You could have lost the mates at one point in your analysis by subsetting a file by intervals or downsampling with a tool that was not mate aware. If so, you should go back and fix these steps.

    Here is another post on the GATK forum where someone troubleshooted this issue, which might be helpful to you: https://gatk.broadinstitute.org/hc/en-us/community/posts/360060977512-GATK-Picard-does-not-detect-mates-in-paired-end-BAM

    0
    Comment actions Permalink
  • Avatar
    Simon Hellemans
    Dear Genevieve,
     
    I did not subsample the reads prior aligning them.
     
    Also, I just confirmed again using fastp that reads are indeed all mated. Here is the corresponding extract of the fastp output:
    Read1 after filtering:
    total reads: 117697682
    Read2 after filtering:
    total reads: 117697682
     
    As per data structure concerning the fix suggested in the link, it would seem this does not apply here.
     
    Numbers 1 and 2 indeed appear in fastq reads:
    ### R1
    @K00308:50:HL7KVBBXX:7:1101:2828:1226 1:N:0:ATTACTCG+AGGCTATA
    CATACAACAAAGATAGTAGAAAATATACTTAGAAGAAGGACTGAAAGAAAAATTGAGGATGTACTTGGAGAAGGTCAGTTTGGATTTAGAAGAGGAAAAGGAACTAGAGATGCGATTGGGATGATGAGAATAATAGCAGAACAAACTTTGG
    +
    AAFFFFJJFFJJJJJA7FFJJFJJAJJJJJJJ-<-FJJFAF-7FFJJJ-FFFJJJFJJF7FJFJAJFJJJJJJFJF<A<<-AJJ7FJFJJJJ<JFJFJF<AJFF7JJJAJFFFA<--77<F7F77FAF---FFJFFA<JJJ--FFJAF7A<
     
    ### R2
    @K00308:50:HL7KVBBXX:7:1101:2828:1226 2:N:0:ATTACTCG+AGGCTATA
    CCAATCAATACCACTTATCTTAAGGATCTGCATTAATTTGGTCCAGTTTACACGGTCAAATGCCTTCTGCGAGTCTATGAAGCAAATACACAGTTCTTCACCCATCTCCAAAGTTTGTTCTGCTATTATTCTCACCATCCCAATCGCACCT
    +
    AAFAFJF-FJJJJFAJFJJF-FJFFAJFJFAFFFJF7AF-F7<AJFJAJJJJJFJFJAJ-J<AFJJJ-FFAFFJJJFJFJFFJJJ<77AAFAA<AJJJFF---7-A7FAJJ7-FFF-7FFFFJAA<FFJ-FAJA-)7-<77F<7A-<7)-)
     
    While it is not visible in the SAM (just for clarity purpose: please note that reads above are not the ones displayed below):
    K00308:50:HL7KVBBXX:8:2228:3823:48913 163 NODE_55712_length_5801_cov_34.588826 1774 60 150M = 2073 450 NCCTCTCTATGTTAACCAATGTTAACCCACACGTATTAAGCGAATTTCACAAGCATTTTATGGCACTAATATTATAATTATAATTCACCTTTCATAACTAACATTATTTTATTTCCACCGTATGGGTTAATTGTAGAATTTTCGGAAGTT #AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJFFFJJJJFFJJ<JF<FJF NM:i:1 MD:Z:0A149 AS:i:149 XS:i:23
     
     
    Is my understanding correct?
     
    Best regards,
    Simon
    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Simon Hellemans how many total reads do you have? Are all of your reads without mates (1179845) or just some?

    I am wondering if one of your pre-processing steps had an error that you did not notice and so many of your reads did not align properly. Could you re-run your pre-processing and confirm that there were no issues? Please also test that this issue persists without using RepeatMasker, since I don't know about that tool, I cannot determine if it is causing these issues.

    0
    Comment actions Permalink
  • Avatar
    Simon Hellemans

    Dear Genevieve,

    Thanks you for your reply.

    As mentioned in my previous post, R1 and R2 both contain 117697682 reads.

    As suggested, I re-ran everything without repeat-masking the genome.

    At first, I still obtained the ERROR:MATE_NOT_FOUND.
    ## HISTOGRAM java.lang.String
    Error Type Count
    ERROR:MATE_NOT_FOUND 89302
    ERROR:MISSING_READ_GROUP 1
    WARNING:RECORD_MISSING_READ_GROUP 236359854
    Tool returned:
    2

    The above was obtained by converting the SAM to BAM using samtools view -F 4 -S -b as previously. Secondly, I converted instead the SAM to BAM without keeping only mapped, using samtools view -S -b.

    This resulted in the ERROR:MATE_NOT_FOUND not occurring!

    ## HISTOGRAM java.lang.String
    Error Type Count
    ERROR:MISSING_READ_GROUP 1
    WARNING:RECORD_MISSING_READ_GROUP 236568066
    Tool returned:
    2

    By using AddOrReplaceReadGroups, it resulted in no more errors or warnings appearing, so that is great!

    I tried the same approach with the repeat-masked version of the genome I was previously reporting but unfortunately ERROR:MATE_NOT_FOUND still occurs.

    I guess having the genome not repeat-masked at this step would be ok, and that I should try to implement that later after variant calling?

    Best regards,
    Simon

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Simon Hellemans, I am not familiar with the repeat masker tool so I am not sure when you should use it in your analysis.

    In terms of GATK, you should see the issue resolved when keeping unmapped reads because anytime you subset the file (such as only keeping mapped reads) you are going to lose some mates. Most tools can handle missing mates but it does indicate something weird happened in your file.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk