Explanation on the HS Penalty metric
REQUIRED for all errors and issues:
a) GATK version used:
 4.2.0.0
b) Exact command used:
 /usr/bin/java Xmx2G jar /opt/tools/picard.jar CollectHsMetrics I 235109_S6.bam O metrics_235109_S6.txt NEAR_DISTANCE 150 TARGET_INTERVALS /data/db/Localisations/interval_list_files/ExomeTwist/TwistRefSeq.interval_list BAIT_INTERVALS /data/db/Localisations/interval_list_files/ExomeTwist/TwistRefSeqBAIT/TwistRefSeqBAIT.baits.interval_list Q 10 MQ 40 R /data/db/References/HG38/HG38.fasta
Background: We do WES sequencing. Several of our samples have a strange profile with very noisy variant calls, which makes it difficult to read the results.
We are trying to understand what could have happened to these samples. After verification, these samples present a high rate of GC with a higher average coverage than the other samples of the same run. In addition, they really stand out on the HS Penalty metrics (calculated with CollectHSMetrics), which leads us to believe that this metric may be the key to identifying the source of the problem.
However, the definition of this metric remains rather vague for us. In the Picard documentation (https://broadinstitute.github.io/picard/picardmetricdefinitions.html#HsMetrics) "HS Penalty" is defined as:
"The "hybrid selection penalty" incurred to get 80% of target bases to 10X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 10X coverage I need to sequence until PF_ALIGNED_BASES = 10 ^7*10*HS_PENALTY_10X."
But this definition does not tell us much. Could you tell us how you define "hybrid selection penalty" and to which manipulation steps you relate this metric?
Thank you for your help
Pierre

[EDIT: This answer is slightly incorrect. Please see my fixed answer in my next post.]
Hi Pierre,
Generally, the HS Metrics involve metrics for targeted sequencing (like WES). You can find some more details here. The metric you asked about is a statistic that measures how even your coverage was among the bases you saw in your targeted regions. For example, suppose you had a 1000 base window you were targeting, and you saw 500 of them at depth 10X, 300 of them at 2X, 100 of them at 1X and 100 of them at 0X.
You can ask: how much more would I have to sequence to get >80% of the covered bases to my target of 6X? Some quick math would suggest you need to sequence 3 times as much, so the 300 base window would jump up to around 6X, bringing 500 + 300 = 800 bases (out of 900 "seen") to 6X. This is a good estimate, but GATK does something a bit more clever: it also incorporates an estimated duplication rate from your data, so when you sequence 3x as much, it models how much of that will be duplicates and actually produces a higher (better) estimate. This number becomes a higher "penalty" if you need to sequence a lot more to get the coverage goal, and smaller if you're already close to the goal (e.g. 10X, 20X, etc. for the different metrics produced).
In short, seeing this number much higher than expected would suggest there is probably a lot of these lower than normal depths in your target regions. This could be caused by a number of different factors, but hopefully this helps you investigate further the cause.

Dear Ricky,
Thank you for your clear answer! It confirms the way I understood the meaning of these metrics.
However I have made an observation I cannot explain which makes me feel that I did not fully understand something.
One of the library exhibits a PCT_TARGET_BASES_10X of 0.9361, that I interpret as « 93.61% of the bases from my design are covered at least at 10X ». Therefore, I would expect to see a HS_PENALTY_10X below 1.0 (i.e. I do not need to sequence deeper to cover 80% of my design at 10X because more than 90% are already covered at this depth). However, for this library, the value of HS_PENALTY_10X is 5.806996.
I have made the same observation at 20, 30 and 40X. Indeed, corresponding HS_PENALTIES are above 1.0 (5.86, 5.91 and 5.95, respetively) while more than 80% of bases are covered to these depths (92.89, 90.88 and 85.59%, respetively).
It seems that unique read rate is quite high (as usual) for this library (PCT_PF_UQ_READS_ALIGNED = 0.998488). Therefore I am not sure that the estimation of duplicates rate by the model could impact HS_PENALTIES as much.
Could you please provide an explanation or a direction to dig about this point?
Many thanks for your kind help.
Pierre 
Hi Pierre,
Thanks for your follow up question. I spoke with a colleague about this and think my explanation for HS_PENALTY_10X was slightly off, which will help explain the confusion here. The HS_PENALTY_10X is a multiplier relative to uniform coverage, not the actual coverage in your sample. I'll modify the previous example here (and edit my old answer to point here):

Suppose you have a sample as before: a 1000 base window you were targeting, and you saw 500 of them at depth 10X, 300 of them at 2X, 100 of them at 1X and 100 of them at 0X. You want this window to be sequenced with 80% of visible ("possible") bases to be at 6X. In an ideal world, you would sequence for 6X here, and get perfectly uniform coverage. In this perfect case, the "HS_PENALTY_6X" would just be 1: you don't need any special multiplier applied to normal 6X to get 6X coverage on 80% of reasonable bases.
As the actual distribution of coverages in this example shows, the reality is messier: the mean coverage is (1/2)*10 + (3/10)*2 + (1/10)*1 + (1/10)*0 = 5.7, already close to 6X, but only around 55% of "seen" bases are at least 6X. So we'll have to sequence to higher depth than 6X in this region to meet the desired criteria. How much more? That's what the HS_PENALTY_6X should show: the ratio between these two.
In this case, we can approximate it by seeing it's enough to bring our 300 bases at 2X up to 6X, and that's most efficient to get to 80% at 6X. Since we can't choose where sequencing will happen within the region, we have to assume that 5.7X we put in will shake out to around 2X in this subregion of 300 bases, so we would need 3*5.7 = 17.1X overall to get us up to 80% at 6X. As a multiplier, we get HS_PENALTY_6X = 17.1 / 6 = 2.85, so we have to sequence 2.85x more than the "expected" 6X..
Of course, this is still oversimplified: we didn't account for duplicates arising from sequencing! (Side note: Your PCT_PF_UQ_READS_ALIGNED metric also has less to do with duplication rates than you might expect. This tracks how many of your unique reads aligned to the reference, so tracking alignment rather than uniqueness.) The real HS_PENALTY_NX metrics will do proper duplication estimating as well to scale the multiplier accordingly. If you'd like to see the code itself for this, you can find it here.

To return to your question now about the Hs Penalty vs Target Bases metrics, we can see how to reconcile the two. Your HS_PENALTY_10X being at 5 means that you needed around 50X (ignoring duplicates) to cover 80% of your targeted bases at 10X (rather than the overly optimistic 10X ideal). Extrapolating from the example I showed above, this seems to suggest quite a wide variance in your coverage. Fortunately for you it seems because PCT_TARGET_BASES_10X is .93, you were able to achieve having 93% of your target bases at 10X. So I'd infer from these alone your sample was probably actually sequenced to depth much higher than 50X to be able to achieve the former. I guess in some sense this says that the higher >50X coverage really was "worth it" to get 80% up to 10X.
Hope this helps! Sorry about the earlier confusion with my first explanation. Let me know if you have further questions.
Please sign in to leave a comment.
3 comments