discrepency between CollectReadCounts and CollectWgsMetrics
REQUIRED for all errors and issues:
a) GATK version used: v4.5.0.0
b) Exact command used:
For CollectWgsMetrics:
Java -jar picard.jar CollectWgsMetrics I=input.bam O=output.txt R=hg38.fasta
For CollectReadCounts:
Java -jar picard.jar CollectReadCounts I=input.bam L=100bp_bin.txt R=hg38.fasta O=sample.counts.tsv --format tsv
For almost all my samples, the average counts calculated by CollectReadCounts is around 2/3 of the mean coverage from CollectWgsMetrics output. I think it might have something to do the read length, which is 150bp paired-end in my case, while the bins are 100bp. Is this the intended behavior? This CollectReadCounts is part of the GATK-SV pipeline, which calculates the median coverages for each sample. Would this difference even make a difference downstream?
Thanks!
Le
-
Hi Le Qi
CollectReadCounts only counts the number of reads passing its filters that have their start sites within the interval so it is not directly comparable to mean or median coverages.
I hope this helps.
-
Hi Gökalp Çelik
Thanks for the reply. I understand this is how CollectReadCounts works, so it's intended behavior then. The only reason I called it coverage was that it's the first step to generate medianCov.transposed.bed file in the EvidenceQC module. This is in turn used by several later modules. I'm having trouble figuring out if it's only the ratio or the absolute number of coverages across samples in this file that matters.
Please sign in to leave a comment.
2 comments