Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Can't explain the number of reads FilterSamReads outputs

Answered
0

8 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi Aki Jarl Laruson,

    Could you see how many unique read names are in your output bam (myfile_subset.bam) and your read list file (myfile.bam.QNAMES)?

    Thank you,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Aki Jarl Laruson

    Hi Genevieve,

    For one example myfile.bam, the number of unique read names is

    > samtools view myfile_subset.bam | awk '{print $1}' - | uniq | wc -l
    1087459
    > samtools view -c myfile_subset.bam
    1854614

    and the number of unique read names in the myfile.bam.QNAMES file is

    > uniq myfile.bam.QNAMES | wc -l
    99999

    Thanks!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    It looks like there are only 99,999 unique read names in in the read list file? Is that true or is it a typo?

    0
    Comment actions Permalink
  • Avatar
    Aki Jarl Laruson

    Oops, that got cut off. That should read 999,999.

    0
    Comment actions Permalink
  • Avatar
    Aki Jarl Laruson

    Fixed the QNAME file occasionally grabbing duplicate read names with

    samtools view myfile.bam | awk '{print $1}' - | uniq | shuf -n 1000000 - > myfile.bam.QNAME

    but I don't believe that's the source of the downstream discrepancy.

    A QNAME file with exactly 1,000,000 unique read names still produces the same ~1.8 million read subset file.

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks for the information Aki Jarl Laruson! I followed up with the developer team to take a closer look on what might be going on here.

    Potentially there could be an issue with how you are finding the unique read names because uniq only removes sequential repeats. Could you check again the unique read names in myfile_subset.bam using sort -u instead of uniq

    The likely cause of this is if the original bam has duplicate read names so the same read name is popping up due to pairs and fragments. 

    Let me know what you find and we will continue to look into it!

    0
    Comment actions Permalink
  • Avatar
    Aki Jarl Laruson

    Thank you! I completely missed that `uniq` was only considering "adjacent" matching lines!

    I re-ran this check with sort -u and got the following:

    samtools view myfile_subset.bam | awk '{print $1}' - | sort -u | wc -l
    927307

    So quite a bit fewer unique reads, but then also re-checking my QNAMES files I see that there's way more redundancy in my QNAMES file than I realized

    sort -u myfile.bam.QNAMES | wc -l
    481827

    That's over 50% non-unique!

    So I re-ran everything using `sort -u` and now get a myfile_subset.bam file that's exactly 2M reads according to `samtools view -c` with exactly 1M unique read names.

    So looks like FilterSamReads is pulling the proper amount of paired reads (and apparently only paired reads), I just wasn't correctly generating unique read names for the QNAME files.

    Thank you so much for your help!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Great, glad we were able to get to the bottom of this! And thanks for posting your conclusion so users in the future who come across this issue will know how to solve it.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk