Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Intervals and interval lists Follow

22 comments

  • Avatar
    WVNicholson

    I can't find anything convincing on how to create a valid Picard interval file although the above information suggests a recipe involving creating the header with "samtools -H" and then adding the required intervals by hand or otherwise.  That may be a dirty hack that could problems in the long run though.  One of the online discussion forums has a thread about this issue and points to a Broad Institute GATK page that no longer exists ("Preparing the essential GATK input files"),

     

    William

    2
    Comment actions Permalink
  • Avatar
    registered_user

    Took me a while to figure this out, but the GATK list format is actually:

    <chr>:<start>-<stop>
    6
    Comment actions Permalink
  • Avatar
    Enrico Cocchi

    How do we download these blacklists that you state you made available?

    7
    Comment actions Permalink
  • Avatar
    Patrícia H. Brito

    Hi,

    How can I access these WDS interval lists?

    "We make our WGS interval lists available, and the good news is that, as long as you're using the same genome reference build as us, you can use them with your own data even if it comes from somewhere else -- assuming you agree with our decisions about which regions to blacklist!"

    2
    Comment actions Permalink
  • Avatar
    Aldhair Médico

    Dear GATK developers,
    I'm trying to run Mutect2 for WES cancer data. 
    However, since the Resource bundle only supports h19 seems I cannot proceed.

    I've been looking for some hg38 interval_list file and I found: ''hg38_v0_HybSelOligos_whole_exome_illumina_coding_v1_whole_exome_illumina_coding_v1.Homo_sapiens_assembly38.targets.interval_list''

    However, when I run the GenomicsDBImport I get the error (no matter if I use my own hg38 reference and .dict or the ones from your Resource Bundle):
    ''A USER ERROR has occurred: Badly formed genome unclippedLoc: Contig chr1 given as location, but this contig isn't present in the Fasta sequence dictionary''

    So, my questions are: 
    1. Is there any release date for this hg38 based exome interval file? will it be soon?
    2. Or the file I put is ok and the error is coming from somewhere else?

    0
    Comment actions Permalink
  • Avatar
    pollyshawn

    How to set Chr01 and Chr02?

    0
    Comment actions Permalink
  • Avatar
    Hee-Bum Yang

    This article was very helpful when I perform 'GenomicsDBimport'.

    However, the working GATK format is actually "<chr>:<start>-<stop>", not "<chr> <start> <stop>" as 'registered_user' said when I run 'GenomicsDBimport'.

    Why don't you update this issue on this article?

    It it not easy to find a solution for the beginner.

     

    0
    Comment actions Permalink
  • Avatar
    Carolina Paez

    This article is pretty informative.

    If I want to do an interval of chromosomes, should I use:

     -L <chr1>-<chr5>

    Any guidance will be appreciated.

     

    3
    Comment actions Permalink
  • Avatar
    Neev Liberman

    How do i include sex chromosomes? Also, can you do an interval as stated above like:

    -L <chr1>-<chr23>
    1
    Comment actions Permalink
  • Neev Liberman and Carolina Paez: I believe that you can not use this syntax. You can either use multiple -L arguments:

    -L chr1 -L chr2 -L chr3 -L chr4 -L chr5

    or use an interval list/bed file with the chromosomes you are after:

    0
    Comment actions Permalink
  • Avatar
    Carolina Paez

    Great to know! Thank you, Dror Kessler (‫דרור קסלר‬‎) for your help.

     

     

    0
    Comment actions Permalink
  • Avatar
    J. Legebeke

    So, what do I put down for the -L argument if I just want to look across the whole genome and not just a specific region?

    1
    Comment actions Permalink
  • Avatar
    Felipe Batalini

    Great explanation, thank you! Since many of the library prep kits are well established and somewhat standard, does Broad have a repository of the most commonly used interval lists? I see that whole_exome_illumina_coding_v1 is used in some workflows (i.e. Exome-Analysis-Pipeline - featured workspace), but what if my sequencing was done using a v6 kit? Where can I find that information? Shouldn't there be a repository? Thank you so much!

    0
    Comment actions Permalink
  • Avatar
    rq m

    I want to confirm the format of bed file.  The article says <chr>:<start>-<stop>, but as far as I know, it seems to be:

    chr1    1049    1500    exon00002       .       -       USA     exon    0       ID=exon00002;Ontology_term="GO:0046703";Ontology_term="GO:0046704"
    chr1    1299    1300    exon00001       .       +       Canada  exon    .       ID=exon00001;score=1;zeroLengthInsertion=True
    chr1    2999    3902    exon00003       .       ?       Canada  exon    2       ID=exon00003;score=4;Name=foo
    chr1    4999    5500    exon00004       .       .       .       exon    .       ID=exon00004;Gap=M8 D3 M6 I1 M6
    chr1    6999    9000    exon00005       10      +       .       exon    1       ID=exon00005;Dbxref="NCBI_gi:10727410"
    0
    Comment actions Permalink
  • Avatar
    Stuart Aidan Quinn

    Derek Caetano-Anolles, thanks for this helpful article! I have the same questions as Enrico Cocchi and Aldhair Médico - which list are you referring to for WGS blacklist (Mutect2 in hg38 cancer dataset). Here, gatk-best-practices/somatic-hg38 I found:

    1. CNV_and_centromere_blacklist.hg38liftover.list
    2. CNV.hg38liftover.bypos.v1.CR1_event_added.mod.seg
    3. final_centromere_hg38.seg

    I believe #1 is the most comprehensive based on the title and the non header line counts. Please correct me if I'm wrong, and thanks again!

    0
    Comment actions Permalink
  • Avatar
    Francesca Tettamanzi

    Dear all, 

    could you please indicate which files - in the several folders of the resource bundle - refer to the WGS interval lists and where they are located?

    Many thanks in advance!

    Kind regards,

    Francesca 

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Francesca Tettamanzi

    If you are looking for hg38 calling regions for whole genome below is the link to that file

    https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.interval_list 

    Regards. 

    1
    Comment actions Permalink
  • Avatar
    Francesca Tettamanzi

    Dear Gökalp Çelik

    many thanks for the rapid reply!

    I take advantage to ask a question related to the topic, as I am new in the analysis of WGS data. I read in a post from GATK blog (https://gatk.broadinstitute.org/hc/en-us/articles/17295731870235-Masked-reference-genomes) that using a masked reference genome can enhance the accuracy of WGS variant calling. 

    Between the use of the provided WGS interval list and of masked genome reference linked above, is there a recommened option over the other?

    Many thanks again!

    Kind regards,

    Francesca 

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi again.

    Using masked reference genomes or reference genomes without alt contigs is actually recommended since handling of alt contigs are not optimal given the current status of aligners. Those calling regions can be applied over to masked genomes as well since we only use the main contigs for variant calling. 

    I hope this helps. 

    0
    Comment actions Permalink
  • Avatar
    Francesca Tettamanzi

    Dear Gökalp Çelik

    I see these are complementary options, that really helps a lot, thank you! Would you mind to indicate me where I can find among the resources a masked hg38 reference (or the version without ALT contigs you use at the Broad) for the purpose of WGS reads alignemnt? I checked it out always at

    https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/

    but could not find anything that seems related to it.

    Many thanks again!

    Kind regards

    Francesca 

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    You can checkout the masked dragen reference files from the bucket below

    https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg38/v0/dragen_reference 

     

    0
    Comment actions Permalink
  • Avatar
    Francesca Tettamanzi

    Many thanks Gökalp Çelik!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk