CollectRnaSeqMetrics GL000220 test not counting ribosomal reads
AnsweredIf you are seeing an error, please provide(REQUIRED) :
a) GATK version used: gatk-4.1.9.0
b) Exact command used:
gatk-4.1.9.0/gatk CollectRnaSeqMetrics -I A56781.STAR.genome.sorted.bam.dup.bam.chrUn_GL000220v1.bam -O test --REF_FLAT GL000220v1.flat --RIBOSOMAL_INTERVALS GL220v1.intervallist --STRAND_SPECIFICITY NONE
c) Entire error log:
Using GATK jar /home/rcorbett/bin/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/rcorbett/bin/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar CollectRnaSeqMetrics -I A56781.STAR.genome.sorted.bam.dup.bam.chrUn_GL000220v1.bam -O test --REF_FLAT GL000220v1.flat --RIBOSOMAL_INTERVALS GL220v1.intervallist --STRAND_SPECIFICITY NONE
10:30:56.734 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/rcorbett/bin/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Mon Dec 06 10:30:56 PST 2021] CollectRnaSeqMetrics --REF_FLAT GL000220v1.flat --RIBOSOMAL_INTERVALS GL220v1.intervallist --STRAND_SPECIFICITY NONE --INPUT A56781.STAR.genome.sorted.bam.dup.bam.chrUn_GL000220v1.bam --OUTPUT test --MINIMUM_LENGTH 500 --RRNA_FRAGMENT_PERCENTAGE 0.8 --METRIC_ACCUMULATION_LEVEL ALL_READS --ASSUME_SORTED true --STOP_AFTER 0 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Mon Dec 06 10:30:56 PST 2021] Executing as rcorbett@gphost11.bcgsc.ca on Linux 3.10.0-957.5.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_242-b08; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.9.0
INFO 2021-12-06 10:30:57 CollectRnaSeqMetrics Loaded 7 genes.
[Mon Dec 06 10:30:57 PST 2021] picard.analysis.CollectRnaSeqMetrics done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=1999110144
Tool returned:
0
The command above reports large amounts of UTR alignments and zero Ribosomal alignments. In the bam file I am using the great majority of reads are aligning near the end of the 45SS annotation. If you like, I can share an example bam with you. Can you help me figure out why the ribosomal count is so small while the UTR count is so high?
Here is a transpose of the output:
RIBOSOMAL_BASES 0
CODING_BASES 0
UTR_BASES 60851
INTRONIC_BASES 1862
INTERGENIC_BASES 6084
IGNORED_READS 0
CORRECT_STRAND_READS 0
INCORRECT_STRAND_READS 0
NUM_R1_TRANSCRIPT_STRAND_READS 0
NUM_R2_TRANSCRIPT_STRAND_READS 2
NUM_UNEXPLAINED_READS 36
PCT_R1_TRANSCRIPT_STRAND_READS 0
PCT_R2_TRANSCRIPT_STRAND_READS 1
PCT_RIBOSOMAL_BASES 0
PCT_CODING_BASES 0
PCT_UTR_BASES 0.884501
PCT_INTRONIC_BASES 0.027065
PCT_INTERGENIC_BASES 0.088434
PCT_MRNA_BASES 0.884501
PCT_USABLE_BASES 0.843396
PCT_CORRECT_STRAND_READS 0
MEDIAN_CV_COVERAGE 2.765139
MEDIAN_5PRIME_BIAS 0.550029
MEDIAN_3PRIME_BIAS 0.212386
MEDIAN_5PRIME_TO_3PRIME_BIAS 0
I have a test bam file containing only reads aligned to chrUn_GL000220v1. My flat file contains only annotations in that region:
XR_950602.2 XR_950602.2 chrUn_GL000220v1 - 4561 8638 8638 8638 2 4561,6313, 4710,8638,
XR_950601.2 XR_950601.2 chrUn_GL000220v1 - 4561 8638 8638 8638 2 4561,5610, 4710,8638,
XR_950600.2 XR_950600.2 chrUn_GL000220v1 - 4561 8638 8638 8638 2 4561,5540, 4710,8638,
NR_038958.1 NR_038958.1 chrUn_GL000220v1 + 97128 126696 126696 126696 7 97128,100238,121604,121828,122990,124777,125783, 97436,100554,121701,122829,123082,124995,126696,
NR_046235.3 NR_046235.3 chrUn_GL000220v1 + 105423 118780 118780 118780 1 105423, 118780,
NR_003286.4 NR_003286.4 chrUn_GL000220v1 + 109077 110946 110946 110946 1 109077, 110946,
NR_003285.3 NR_003285.3 chrUn_GL000220v1 + 112023 112180 112180 112180 1 112023, 112180,
NR_003287.4 NR_003287.4 chrUn_GL000220v1 + 113347 118417 118417 118417 1 113347, 118417,
XR_002958845.1 XR_002958845.1 chrUn_GL000220v1 - 149692 156087 156087 156087 4 149692,155239,155699,156078, 153062,155499,155906,156087,
My ribosomal interval also only contains annotations from the same region:
@HD VN:1.4 SO:coordinate
@SQ SN:chr1 LN:248956422
@SQ SN:chr2 LN:242193529
@SQ SN:chr3 LN:198295559
@SQ SN:chr4 LN:190214555
@SQ SN:chr5 LN:181538259
@SQ SN:chr6 LN:170805979
@SQ SN:chr7 LN:159345973
@SQ SN:chr8 LN:145138636
@SQ SN:chr9 LN:138394717
@SQ SN:chr10 LN:133797422
@SQ SN:chr11 LN:135086622
@SQ SN:chr12 LN:133275309
@SQ SN:chr13 LN:114364328
@SQ SN:chr14 LN:107043718
@SQ SN:chr15 LN:101991189
@SQ SN:chr16 LN:90338345
@SQ SN:chr17 LN:83257441
@SQ SN:chr18 LN:80373285
@SQ SN:chr19 LN:58617616
@SQ SN:chr20 LN:64444167
@SQ SN:chr21 LN:46709983
@SQ SN:chr22 LN:50818468
@SQ SN:chrX LN:156040895
@SQ SN:chrY LN:57227415
@SQ SN:chrM LN:16569
@SQ SN:chr1_KI270706v1_random LN:175055
@SQ SN:chr1_KI270707v1_random LN:32032
@SQ SN:chr1_KI270708v1_random LN:127682
@SQ SN:chr1_KI270709v1_random LN:66860
@SQ SN:chr1_KI270710v1_random LN:40176
@SQ SN:chr1_KI270711v1_random LN:42210
@SQ SN:chr1_KI270712v1_random LN:176043
@SQ SN:chr1_KI270713v1_random LN:40745
@SQ SN:chr1_KI270714v1_random LN:41717
@SQ SN:chr2_KI270715v1_random LN:161471
@SQ SN:chr2_KI270716v1_random LN:153799
@SQ SN:chr3_GL000221v1_random LN:155397
@SQ SN:chr4_GL000008v2_random LN:209709
@SQ SN:chr5_GL000208v1_random LN:92689
@SQ SN:chr9_KI270717v1_random LN:40062
@SQ SN:chr9_KI270718v1_random LN:38054
@SQ SN:chr9_KI270719v1_random LN:176845
@SQ SN:chr9_KI270720v1_random LN:39050
@SQ SN:chr11_KI270721v1_random LN:100316
@SQ SN:chr14_GL000009v2_random LN:201709
@SQ SN:chr14_GL000225v1_random LN:211173
@SQ SN:chr14_KI270722v1_random LN:194050
@SQ SN:chr14_GL000194v1_random LN:191469
@SQ SN:chr14_KI270723v1_random LN:38115
@SQ SN:chr14_KI270724v1_random LN:39555
@SQ SN:chr14_KI270725v1_random LN:172810
@SQ SN:chr14_KI270726v1_random LN:43739
@SQ SN:chr15_KI270727v1_random LN:448248
@SQ SN:chr16_KI270728v1_random LN:1872759
@SQ SN:chr17_GL000205v2_random LN:185591
@SQ SN:chr17_KI270729v1_random LN:280839
@SQ SN:chr17_KI270730v1_random LN:112551
@SQ SN:chr22_KI270731v1_random LN:150754
@SQ SN:chr22_KI270732v1_random LN:41543
@SQ SN:chr22_KI270733v1_random LN:179772
@SQ SN:chr22_KI270734v1_random LN:165050
@SQ SN:chr22_KI270735v1_random LN:42811
@SQ SN:chr22_KI270736v1_random LN:181920
@SQ SN:chr22_KI270737v1_random LN:103838
@SQ SN:chr22_KI270738v1_random LN:99375
@SQ SN:chr22_KI270739v1_random LN:73985
@SQ SN:chrY_KI270740v1_random LN:37240
@SQ SN:chrUn_KI270302v1 LN:2274
@SQ SN:chrUn_KI270304v1 LN:2165
@SQ SN:chrUn_KI270303v1 LN:1942
@SQ SN:chrUn_KI270305v1 LN:1472
@SQ SN:chrUn_KI270322v1 LN:21476
@SQ SN:chrUn_KI270320v1 LN:4416
@SQ SN:chrUn_KI270310v1 LN:1201
@SQ SN:chrUn_KI270316v1 LN:1444
@SQ SN:chrUn_KI270315v1 LN:2276
@SQ SN:chrUn_KI270312v1 LN:998
@SQ SN:chrUn_KI270311v1 LN:12399
@SQ SN:chrUn_KI270317v1 LN:37690
@SQ SN:chrUn_KI270412v1 LN:1179
@SQ SN:chrUn_KI270411v1 LN:2646
@SQ SN:chrUn_KI270414v1 LN:2489
@SQ SN:chrUn_KI270419v1 LN:1029
@SQ SN:chrUn_KI270418v1 LN:2145
@SQ SN:chrUn_KI270420v1 LN:2321
@SQ SN:chrUn_KI270424v1 LN:2140
@SQ SN:chrUn_KI270417v1 LN:2043
@SQ SN:chrUn_KI270422v1 LN:1445
@SQ SN:chrUn_KI270423v1 LN:981
@SQ SN:chrUn_KI270425v1 LN:1884
@SQ SN:chrUn_KI270429v1 LN:1361
@SQ SN:chrUn_KI270442v1 LN:392061
@SQ SN:chrUn_KI270466v1 LN:1233
@SQ SN:chrUn_KI270465v1 LN:1774
@SQ SN:chrUn_KI270467v1 LN:3920
@SQ SN:chrUn_KI270435v1 LN:92983
@SQ SN:chrUn_KI270438v1 LN:112505
@SQ SN:chrUn_KI270468v1 LN:4055
@SQ SN:chrUn_KI270510v1 LN:2415
@SQ SN:chrUn_KI270509v1 LN:2318
@SQ SN:chrUn_KI270518v1 LN:2186
@SQ SN:chrUn_KI270508v1 LN:1951
@SQ SN:chrUn_KI270516v1 LN:1300
@SQ SN:chrUn_KI270512v1 LN:22689
@SQ SN:chrUn_KI270519v1 LN:138126
@SQ SN:chrUn_KI270522v1 LN:5674
@SQ SN:chrUn_KI270511v1 LN:8127
@SQ SN:chrUn_KI270515v1 LN:6361
@SQ SN:chrUn_KI270507v1 LN:5353
@SQ SN:chrUn_KI270517v1 LN:3253
@SQ SN:chrUn_KI270529v1 LN:1899
@SQ SN:chrUn_KI270528v1 LN:2983
@SQ SN:chrUn_KI270530v1 LN:2168
@SQ SN:chrUn_KI270539v1 LN:993
@SQ SN:chrUn_KI270538v1 LN:91309
@SQ SN:chrUn_KI270544v1 LN:1202
@SQ SN:chrUn_KI270548v1 LN:1599
@SQ SN:chrUn_KI270583v1 LN:1400
@SQ SN:chrUn_KI270587v1 LN:2969
@SQ SN:chrUn_KI270580v1 LN:1553
@SQ SN:chrUn_KI270581v1 LN:7046
@SQ SN:chrUn_KI270579v1 LN:31033
@SQ SN:chrUn_KI270589v1 LN:44474
@SQ SN:chrUn_KI270590v1 LN:4685
@SQ SN:chrUn_KI270584v1 LN:4513
@SQ SN:chrUn_KI270582v1 LN:6504
@SQ SN:chrUn_KI270588v1 LN:6158
@SQ SN:chrUn_KI270593v1 LN:3041
@SQ SN:chrUn_KI270591v1 LN:5796
@SQ SN:chrUn_KI270330v1 LN:1652
@SQ SN:chrUn_KI270329v1 LN:1040
@SQ SN:chrUn_KI270334v1 LN:1368
@SQ SN:chrUn_KI270333v1 LN:2699
@SQ SN:chrUn_KI270335v1 LN:1048
@SQ SN:chrUn_KI270338v1 LN:1428
@SQ SN:chrUn_KI270340v1 LN:1428
@SQ SN:chrUn_KI270336v1 LN:1026
@SQ SN:chrUn_KI270337v1 LN:1121
@SQ SN:chrUn_KI270363v1 LN:1803
@SQ SN:chrUn_KI270364v1 LN:2855
@SQ SN:chrUn_KI270362v1 LN:3530
@SQ SN:chrUn_KI270366v1 LN:8320
@SQ SN:chrUn_KI270378v1 LN:1048
@SQ SN:chrUn_KI270379v1 LN:1045
@SQ SN:chrUn_KI270389v1 LN:1298
@SQ SN:chrUn_KI270390v1 LN:2387
@SQ SN:chrUn_KI270387v1 LN:1537
@SQ SN:chrUn_KI270395v1 LN:1143
@SQ SN:chrUn_KI270396v1 LN:1880
@SQ SN:chrUn_KI270388v1 LN:1216
@SQ SN:chrUn_KI270394v1 LN:970
@SQ SN:chrUn_KI270386v1 LN:1788
@SQ SN:chrUn_KI270391v1 LN:1484
@SQ SN:chrUn_KI270383v1 LN:1750
@SQ SN:chrUn_KI270393v1 LN:1308
@SQ SN:chrUn_KI270384v1 LN:1658
@SQ SN:chrUn_KI270392v1 LN:971
@SQ SN:chrUn_KI270381v1 LN:1930
@SQ SN:chrUn_KI270385v1 LN:990
@SQ SN:chrUn_KI270382v1 LN:4215
@SQ SN:chrUn_KI270376v1 LN:1136
@SQ SN:chrUn_KI270374v1 LN:2656
@SQ SN:chrUn_KI270372v1 LN:1650
@SQ SN:chrUn_KI270373v1 LN:1451
@SQ SN:chrUn_KI270375v1 LN:2378
@SQ SN:chrUn_KI270371v1 LN:2805
@SQ SN:chrUn_KI270448v1 LN:7992
@SQ SN:chrUn_KI270521v1 LN:7642
@SQ SN:chrUn_GL000195v1 LN:182896
@SQ SN:chrUn_GL000219v1 LN:179198
@SQ SN:chrUn_GL000220v1 LN:161802
@SQ SN:chrUn_GL000224v1 LN:179693
@SQ SN:chrUn_KI270741v1 LN:157432
@SQ SN:chrUn_GL000226v1 LN:15008
@SQ SN:chrUn_GL000213v1 LN:164239
@SQ SN:chrUn_KI270743v1 LN:210658
@SQ SN:chrUn_KI270744v1 LN:168472
@SQ SN:chrUn_KI270745v1 LN:41891
@SQ SN:chrUn_KI270746v1 LN:66486
@SQ SN:chrUn_KI270747v1 LN:198735
@SQ SN:chrUn_KI270748v1 LN:93321
@SQ SN:chrUn_KI270749v1 LN:158759
@SQ SN:chrUn_KI270750v1 LN:148850
@SQ SN:chrUn_KI270751v1 LN:150742
@SQ SN:chrUn_KI270752v1 LN:27745
@SQ SN:chrUn_KI270753v1 LN:62944
@SQ SN:chrUn_KI270754v1 LN:40191
@SQ SN:chrUn_KI270755v1 LN:36723
@SQ SN:chrUn_KI270756v1 LN:79590
@SQ SN:chrUn_KI270757v1 LN:71251
@SQ SN:chrUn_GL000214v1 LN:137718
@SQ SN:chrUn_KI270742v1 LN:186739
@SQ SN:chrUn_GL000216v2 LN:176608
@SQ SN:chrUn_GL000218v1 LN:161147
@SQ SN:chrEBV LN:171823
@PG ID:STAR PN:STAR CL:STAR --runThreadN 8 --genomeDir /projects/alignment_references/9606/hg38_no_alt/transcriptome/star-2.5.2b/hg38_no_alt_ensembl100/rsem_1.3.0 --genomeLoad NoSharedMemory --readFilesIn /projects/seqdev/UGNE-2119/Data_IX10127/F80619_merged_R1_3M.fastq.gz /projects/seqdev/UGNE-2119/Data_IX10127/F80619_merged_R2_3M.fastq.gz --readFilesCommand zcat --outFileNamePrefix F80619.temp/F80619 --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD --outSAMunmapped Within --outSAMheaderHD @HD VN:1.4 SO:unsorted --outFilterType BySJout --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --quantMode TranscriptomeSAM VN:STAR_2.5.2b
@PG ID:sambamba markdup CL:/gsc/software/linux-x86_64-centos7/sambamba-0.6.1/sambamba_v0.6.1 markdup -t24 F80619.STAR.genome.bam F80619.STAR.genome.markDup.bam PP:STAR VN:0.6.1
@CO user command line: STAR --genomeDir /projects/alignment_references/9606/hg38_no_alt/transcriptome/star-2.5.2b/hg38_no_alt_ensembl100/rsem_1.3.0 --outSAMunmapped Within --outFilterType BySJout --outSAMattributes NH HI AS NM MD --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --runThreadN 8 --genomeLoad NoSharedMemory --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --outSAMheaderHD @HD VN:1.4 SO:unsorted --outFileNamePrefix F80619.temp/F80619 --readFilesCommand zcat --readFilesIn /projects/seqdev/UGNE-2119/Data_IX10127/F80619_merged_R1_3M.fastq.gz /projects/seqdev/UGNE-2119/Data_IX10127/F80619_merged_R2_3M.fastq.gz
chrUn_GL000220v1 105424 118780 + gene_id "RNA45SN5"; transcript_id "NR_046235.3"; db_xref "GeneID:100861532"; gene "RNA45SN5"; product "RNA, 45S pre-ribosomal N5"; transcript_biotype "rRNA"; exon_number "1";
chrUn_GL000220v1 105424 118780 + gene_id "RNA45SN5"; transcript_id "NR_046235.3"; db_xref "GeneID:100861532"; gbkey "rRNA"; gene "RNA45SN5"; product "RNA, 45S pre-ribosomal N5"; transcript_biotype "rRNA";
chrUn_GL000220v1 109078 110946 + gene_id "RNA18SN5"; transcript_id "NR_003286.4"; db_xref "GeneID:100008588"; gene "RNA18SN5"; product "RNA, 18S ribosomal RNA N5"; transcript_biotype "rRNA"; exon_number "1";
chrUn_GL000220v1 109078 110946 + gene_id "RNA18SN5"; transcript_id "NR_003286.4"; db_xref "GeneID:100008588"; gbkey "rRNA"; gene "RNA18SN5"; product "RNA, 18S ribosomal RNA N5"; transcript_biotype "rRNA";
chrUn_GL000220v1 113348 118417 + gene_id "RNA28SN5"; transcript_id "NR_003287.4"; db_xref "GeneID:100008589"; gene "RNA28SN5"; product "RNA, 28S ribosomal RNA N5"; transcript_biotype "rRNA"; exon_number "1";
chrUn_GL000220v1 113348 118417 + gene_id "RNA28SN5"; transcript_id "NR_003287.4"; db_xref "GeneID:100008589"; gbkey "rRNA"; gene "RNA28SN5"; product "RNA, 28S ribosomal RNA N5"; transcript_biotype "rRNA";
-
Hi rcorbett,
Could you clarify if you think these results from CollectRnaSeqMetrics is incorrect for the input bams or if you are asking why your data is aligning to UTRs instead of ribosomal intervals?
Thank you,
Genevieve
-
Many thanks for your help Genevieve Brandt (she/her).
It is more that I am confused by the output. I am not expecting a high UTR amount as my data are almost exclusively ribosomal and 45S, to my knowledge contains only an ETS, not a UTR.
Also, my flatfile doesn't include a UTR annotation (unless I am mistaken).
So I guess I might be suggesting the results are incorrect, but I'm equally open to the issue being user error :)
-
One of the issues may be with your ref_flat file. The coding region start and end are the same value at the end of the transcript. Here is a link to more information about the ref_flat file format, the coding region start and end are the 3rd and 4th number in each annotation.
The way your ref_flat file is now is saying that the coding regions are 0 bases long. Since UTR regions are everything outside of the coding regions, this is probably why your UTR counts are so high.
The ribosomal count being 0 is more puzzling. Could you confirm that there are reads that overlap with the ribosomal regions? The reads would only be counted if their mates are not unmapped and their chromosome is the same as their mate chromosome. Double check for those possible issues and let me know what you find!
-
OK. I think we're making some progress here. I am starting with a refseq .gtf file where the annotation for my rRNA looks like this:
chrUn_GL000220v1 BestRefSeq transcript 105424 118780 . + . gene_id "RNA45SN5"; transcript_id "NR_046235.3"; db_xref "GeneID:100861532"; gbkey "rRNA"; gene "RNA45SN5"; product "RNA, 45S pre-ribosomal N5"; transcript_biotype "rRNA";
chrUn_GL000220v1 BestRefSeq exon 105424 118780 . + . gene_id "RNA45SN5"; transcript_id "NR_046235.3"; db_xref "GeneID:100861532"; gene "RNA45SN5"; product "RNA, 45S pre-ribosomal N5"; transcript_biotype "rRNA"; exon_number "1";Then I use gtfToGenePred to convert to a flat file with this command:
./gtfToGenePred -ignoreGroupsWithoutExons all.gtf all.flat
and then my flat file contains
NR_046235.3 NR_046235.3 chrUn_GL000220v1 + 105423 118780 118780 118780 1 105423, 118780,
As you pointed out above, this entry contains no coding region. Maybe there is a better way to go from refseq gtf to refflat?
My sample contains reads almost exclusively at the terminal end of the the 45S annotation. I am attaching and IGV screenshot of the region.
-
rcorbett we don't have any specific recommendations for how to create the ref_flat file. Since we are not the creators of gtfToGenePred, we are not able to look more closely about what might be going wrong. I would recommend reaching out to the developers of that tool for more information.
I did find a potential alternate tool but I have not tested it myself: https://biopet.github.io/gtftorefflat/develop/
I'm sorry we don't have more information. Please let me know if you have further questions.
Please sign in to leave a comment.
5 comments