CollectReadCounts tutorial incorrectly states default read quality threshold
AnsweredIf you are seeing an error, please provide(REQUIRED) :
a) GATK version used: 4.1.4.1
b) Exact command used:
gatk CollectReadCounts \
--minimum-mapping-quality $MQmin \
-L $binnedIntervalsPadded \
-R $refGen -imr OVERLAPPING_ONLY \
-I ${bamList[${i}]} --format TSV \
-O $tsvFolderFinalPath/${tsvList[${i}]}
performed on targeted wgs data. Padding 0bp, intervals 300bp
The gatk documentation says that getreadcounts uses a default 10MQ cutoff:
However. I have extensively tested the filters on this step because the counts I am getting do not match the IGV visualization.
note: I know counts are defined as the number of start read sites in a window.
I have turned off every other filter except quality and I am getting a count of 1 in a window identified as a double deletion by gCNV pipeline. However, the count is actually 45, by manually counting read start sites in the given interval in IGV with a quality filter of 10.
I finally identified the quality filter as the cause and set to experiment with thresholds:
MQmin=0: count = 46
MQmin=5: count = 45
MQmin=10: count = 45
MQmin=30: count = 1
MQmin=36: count = 0
I feel silly now for wasting so much time trusting that the defaults were what they said they were, causing me to look everywhere but the quality filter for the cause of my discrepancy.
So firstly: There is an error here: either an error in the tutorial documentation saying the default is 10 when it is actually 30, or an error in the code's defaults, setting to 30 when it should be 10.
and Secondly (if the 30 default is correct): Why is the default GATK mapping quality filter so high for the purposes of cnv analysis, and should I be using a different threshold when identifying double deletions. The current default seems absurd but I would like some feedback from long time users of the tool.
best regards,
Simon
-
Hi Simon,
We took a closer look at this tutorial and it seems that you are correct and there is a typo in the tutorial:
This means the tool excludes reads marked as duplicate and excludes reads with mapping quality less than 10.
The default for the MappingQualityReadFilter is 30. I'll have our documentation team fix this issue, thank you for pointing it out! I'm sorry that it caused you such a headache looking into this problem. Sometimes our tutorial documentation becomes out of date, so feel free to reach out to us if you think something is strange.
In terms of why the value is set at 30, my colleague has a note about that:
This is specifically set for the both Germline and Somatic CNV pipelines optimized for WES samples where higher quality data translates to more accurate calls. If the user need to increase sensitivity for regions of low mappability, they can just turn that parameter down in their runs.
Please let me know if you have any further questions.
Best,
Genevieve
Please sign in to leave a comment.
1 comment