Broad hg38 ICE interval list
AnsweredCan you please provide
a) GATK version used : 4.1.4.1
b) Exact GATK commands used: GATK Somatic CNV workflows, GATK Somatic SNV workflows
I am updating our lab's somatic characterization workflows to run on hg38-aligned data. I need an interval list (or BED file), defining the hg38 target intervals for the ICE exome capture kit used here at the Broad. The page describing the Broad Resource Bundle currently states:
In addition, we are currently transitioning to support the Grch38/hg38 reference build, but we have not yet generated all of the files necessary for all use cases (in particular we are still missing the Hg38 version of the Broad's exome intervals).
How do I obtain or generate an hg38 version of this exome interval list?
Thank you,
Chet Birger
Getz Lab
-
Files are here now: https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg38/v0/HybSelOligos
Have a good weekend!
-
Is there a standard place to look for this sort of thing? I'm always confused trying to find interval lists.
-
I've asked pipeline ops to move this interval list to the gcp-public--broad-references bucket - that is one of the standard places to find interval lists. The team is actually working on documentation about the buckets we point to in best practice pipelines in the gatk-worklows repos to provide more metadata to users.
-
Hi Tiffany Miller , I have downloaded the interval file from https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg38/v0/HybSelOligos, the fifth field of the file "whole_exome_illumina_coding_v1.Homo_sapiens_assembly38.targets.interval_list" is the coordinates of hg19? What is the difference between the targets.interval_list and baits.interval_list?
-
Nickier Targets are the regions the exome capture is aiming to cover . Baits are the positions of the actual bait sequences that are used in the exome capture process. So targets are a region of interest that the assay is designed to target, and the baits are where the actual molecules that are used during the DNA capture process are aligned. They roughly correspond, but targets are generally larger than baits and difficult regions / long targets may require multiple baits.
In general you probably want to be using at the targets file, that's the region of interest. If you're trying to analyze capture efficiency or something about the sequencing process itself you'd probably want to look at the baits as well.
-
Thank you very much, I still have a question. Can I use this bed file to replace the interval file? The bed file is downloaded from the CCDS database and converted to bed format.
## bed wget ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/current_human/CCDS.current.txt cat CCDS.current.txt | grep "Public" | perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i |awk '{if($3>$2) print "chr"$0}' > hg38.exon.bed
-
Hi Chet! I am confirming with the production team, but I believe this is the one: gs://gcp-public-data--broad-references/hg38/v0/exome_calling_regions.v1.interval_list
-
This is not the correct file (apparently this is the joint calling interval list). We are finding the public one to point you to.
-
Update: The file is scheduled to be released in the gcp-public--broad-references on Friday.
-
Hi Tiffany Miller I am looking for the hg19 version of this file, both the targets and the baits. I poked around the buckets above but didn't see anything that immediately stood out. Any direction would be great. Thank you!!
-
Maybe we should have a readme file for all the public data folders that describe what the files are?
-
Agreed Louis Bergelson . There is a ReadMe, but it looks massively out of date.
dannykwells I've asked our team if we can get this file moved over. Then we have to coordinate with GCP to get it over since they are sponsoring the bucket. I'll let you know when that is done. May take a week or so.
-
Nickier the targets interval list for hg38 you pointed to was accurate for defining the hg38 target intervals for the ICE exome capture kit used at the Broad. Does using this bed file you are pointing to make sense for what you are doing?
-
Tiffany Miller Thanks~~ Actually I am not sure if this is correct I descripted above, I just saw it in some tutorials. Maybe I should use the * _Regions.bed provided by Agilent, because my exon capture kit is SureSelect Human All Exon V7, and I also downloaded this file on the Agilent website. By the way, do I need to convert bed file to interval file? On the left is the regions bed file I downloaded from the Agilent website, on the right is the target interval provided by GATK team.
-
I already got the answer at this tutorial, thank you again~~
-
You want to use the target files for the capture kit your data was generated with. What we provided here to answer the original post was for the ICE exome capture kit by Ilumina.
dannykwells I am still waiting on these files to get transferred to GCP. Sorry for the wait.
-
dannykwells FYI, the hg19 version of this file, both the targets and the baits are now available in the gcp bucket: https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg19/v0/HybSelOligos/whole_exome_illumina_coding_v1/?pli=1
Please sign in to leave a comment.
17 comments