GATK CNV interpret Segment_Mean and MEAN_LOG2_COPY_RATIO
AnsweredHi GATK Team,
When I use the GATK CNV workflow. I notice in the output file *.igv.seg, the `Segment_Mean` is centered at 1. In your documents, I found "The mean log2 copy ratio is given in the SEGMENT_MEAN column." In my results, this column is 0-Inf, and it doesn't feel like a log2 number (no negative). I feel it is more like a raw number of coverage sample/ coverage PON and take the mean in the region. So I would like to ask:
1. Where does the `log2` take place in segment mean calculation?
2. How and why does the program center the segment mean at 1? I see other tools center the log2 copy ratio at 0 which makes sense to me (log2(2/2) = 0 ).
3. I noticed the `MEAN_LOG2_COPY_RATIO` in `*.called.seg` files are very similar to log2(Segment_Mean) but slightly different. Is there any calibration in the transformation?
Thanks!
-
I just checked the output from a coworker with the latest GATK. It seems you guys changed the output of both files to be the same on 4.1.6. Right now it is the same log2 value. My results were generated on 4.1.0.
-
Hi ,
The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.
Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.
We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.
For context, check out our support policy.
-
Hi @lzhan140,
Yes, we changed the output of SEGMENT_MEAN, etc. at some point to be in log2 space.
Note that any files with a SEGMENT_MEAN column are in the "legacy" CBS-style seg-file format. This is either because they are intended to be compatible with IGV (e.g., *.igv.seg) for convenience of plotting, or because we wanted to preserve some compatibility with downstream tools or legacy functionality in CallCopyRatioSegments (which was based on an older ReCapSeg caller).
So the quantity that appears in that column may be slightly different, depending on the use. For example, in the *.igv.seg files output by ModelSegments, as documented: "The posterior medians of the log2 copy ratio and minor-allele fraction are given in the SEGMENT_MEAN columns in the .cr.igv.seg and .af.igv.seg files, respectively." So this should line up with the *_POSTERIOR_50 quantities reported in *.modeled.seg (which should be considered the primary output of ModelSegments).
In contrast, the quantity that appears in the SEGMENT_MEAN column in the *.cr.seg file (which is passed to CallCopyRatioSegments) is simply the mean of the log2 copy-ratio data contained in that segment---which is not the same thing as the median of the log2 copy-ratio posterior that is fit by the ModelSegments model (although it will be close). This is simply because this quantity is what the ReCapSeg-style caller in CallCopyRatioSegments expects.
I know that's a little confusing, and it's always been our intention to replace CallCopyRatioSegments with a better caller and get rid of a lot of these legacy outputs/formats. Unfortunately, we haven't gotten around to that just yet!
Please sign in to leave a comment.
3 comments