Can't explain the number of reads FilterSamReads outputs
AnsweredHello,
I'm attempting to subset several bam files by a very specific number of reads for downstream comparisons, but when I filter my bam files using FilterSamReads I end up with an inconsistent number of reads that I can't quite explain.
I'm using the `shuf` linux command to randomly sample 1 million read names from the sorted and indexed paired-end bam files:
samtools view myfile.bam | awk '{print $1}' - | shuf -n 1000000 - > myfile.bam.QNAME
I'm then using jdk1.8.0_121 to run the picard version 2.24.0 command FilterSamReads like so:
java -Xmx40g -jar /picard-tools-2.24.0/picard.jar FilterSamReads \
I = myfile.bam \
O = myfile_subset.bam \
READ_LIST_FILE = myfile.bam.QNAMES \
FILTER = includeReadList
All my output files from this have around 1.8 million reads, counted with
samtools view -c myfile_subset.bam
I at first thought FilterSamReads must be pulling out 90% paired mapped reads and the reason I was about 200K reads short of 2 million was that about 10% of reads did not have a mapped mate. However when I look at the number of reads with an unmapped mate with `samtools view -c -f8 myfile_subset.bam` the number is close (~15%), but doesn't quite explain the total number of reads I'm seeing in the myfile_subset.bam file.
What criteria is FilterSamReads with the includeReadList filter actually imposing to give me the number of reads I'm seeing?
-
Hi Aki Jarl Laruson,
Could you see how many unique read names are in your output bam (myfile_subset.bam) and your read list file (myfile.bam.QNAMES)?
Thank you,
Genevieve
-
Hi Genevieve,
For one example myfile.bam, the number of unique read names is
> samtools view myfile_subset.bam | awk '{print $1}' - | uniq | wc -l
1087459
> samtools view -c myfile_subset.bam
1854614and the number of unique read names in the myfile.bam.QNAMES file is
> uniq myfile.bam.QNAMES | wc -l
99999Thanks!
-
It looks like there are only 99,999 unique read names in in the read list file? Is that true or is it a typo?
-
Oops, that got cut off. That should read 999,999.
-
Fixed the QNAME file occasionally grabbing duplicate read names with
samtools view myfile.bam | awk '{print $1}' - | uniq | shuf -n 1000000 - > myfile.bam.QNAME
but I don't believe that's the source of the downstream discrepancy.
A QNAME file with exactly 1,000,000 unique read names still produces the same ~1.8 million read subset file.
-
Thanks for the information Aki Jarl Laruson! I followed up with the developer team to take a closer look on what might be going on here.
Potentially there could be an issue with how you are finding the unique read names because uniq only removes sequential repeats. Could you check again the unique read names in myfile_subset.bam using sort -u instead of uniq?
The likely cause of this is if the original bam has duplicate read names so the same read name is popping up due to pairs and fragments.
Let me know what you find and we will continue to look into it!
-
Thank you! I completely missed that `uniq` was only considering "adjacent" matching lines!
I re-ran this check with sort -u and got the following:
samtools view myfile_subset.bam | awk '{print $1}' - | sort -u | wc -l
927307So quite a bit fewer unique reads, but then also re-checking my QNAMES files I see that there's way more redundancy in my QNAMES file than I realized
sort -u myfile.bam.QNAMES | wc -l
481827That's over 50% non-unique!
So I re-ran everything using `sort -u` and now get a myfile_subset.bam file that's exactly 2M reads according to `samtools view -c` with exactly 1M unique read names.
So looks like FilterSamReads is pulling the proper amount of paired reads (and apparently only paired reads), I just wasn't correctly generating unique read names for the QNAME files.
Thank you so much for your help!
-
Great, glad we were able to get to the bottom of this! And thanks for posting your conclusion so users in the future who come across this issue will know how to solve it.
Please sign in to leave a comment.
8 comments