Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Interpretation of gene_summary output from DepthofCoverage

Answered
0

11 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi neethukrishna Kausthubham,

    Thanks for writing into the forum about this issue! Could you check in your RefSeq file to see if the SAMD11 gene is repeated? If you are using a public RefSeq file, let me know which it is so I can take a look.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Seunghun Han

    Hi,

    I'm having the same issue, and I'm using a refseq gene list generated by following this article - https://gatk.broadinstitute.org/hc/en-us/articles/360035532032-RefSeq-gene-list-format
    So the refseq gene list actually has multiple record (transcripts) for genes, and even with multiple record, GATK DepthOfCoverage from GATK 3.7 generates aggregate gene level summary (one line per gene), but with DoC from latest GATK, now it's producing multiple record per gene 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Seunghun Han,

    Thanks for writing into the forum about this! I confirmed with the developer that this behavior is expected. GATK3 producing one aggregate number did not treat the different transcripts differently at all. In GATK4, we wanted to make sure that if there were overlapping genes or transcripts, they wouldn't get merged and they would get individually measured. 

    Would you be able to look through IGV and confirm that the numbers make sense for the transcripts on each line?

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Seunghun Han

    Hi Genevieve,

    I checked IGV and gene summary records for multiple transcripts and it looks like the numbers make sense. I think I will just go ahead and modify the refseq gene list so that each gene has a single representative transcript to fix the above mentioned issue. However, I noticed another behavior which wasn't a problem in GATK 3.7. A few of genes have this weird symbol -� 
    in their "average_coverage", "sampleid_mean_cvg", and "sampleid_%_above_15" columns as shown in the screenshot attached. Looks like it's happening only with the genes with no coverage, but most of the other genes with no coverage have 0 instead of � in the same columns. Is this a known bug? 

    Best,
    Seunghun

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Which file is this appearing in?

    0
    Comment actions Permalink
  • Avatar
    Seunghun Han

    Both interval level and gene level summary outputs have this. The screenshot above was from gene level summary output

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Seunghun Han, it's not clear to me whether this is an issue with the GATK output or with the method you are using to view the file. Could you share a screenshot of what this output looks like on the command line with the head command?

    I also would like to see how it is different than what was in GATK3.

    Thank you!

    0
    Comment actions Permalink
  • Avatar
    Seunghun Han

    I don't think it has anything to do with the way I'm viewing the file. 
    This is how it looks when I opened on of the gene level summary on VIM.
    I don't have outputs from GATK3 and GATK4 DoC runs on an identical files, so
    can't really make a head to head comparison here, but I checked several DoC 
    output from GATK3 DoC runs, and didn't find � in them. 

    Also, a downstream tool I'm using takes these gene level and interval level summary
    outputs from DoC as inputs, and the tool worked fine with GATK3 outputs, but now with 
    � symbols in the outputs from GATK4, the existence of � is affecting the data type of some of the columns where there are only supposed to be numbers, and the tool is now failing. 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Okay I see, thanks for sharing these updates! What version of GATK4 are you running?

    0
    Comment actions Permalink
  • Avatar
    Seunghun Han

    I'm using broadinstitute/gatk:latest docker to run DoC. 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Seunghun Han,

    Could you upload your files that contain this issue in a zipped folder to our bug report FTP? There are instructions for how to do that here: https://gatk.broadinstitute.org/hc/en-us/articles/360035889671

    Best,

    Genevieve

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk