Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

CollectAllelicCounts interval_list

Answered
1

15 comments

  • Avatar
    Bhanu Gandham

    Hi ,

    The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.

    Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.

    We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.

    For context, check out our support policy.

     

    0
    Comment actions Permalink
  • Avatar
    Elizabeth237

    hi,

    I ran into the same issue. The gnomAD SNPs-only VCF file(hg38) can not be downloaded.  If this file is not available, could you provide me with some SelectVariants parameters or methods to help me generate the file by myself.

     

    thanks a lot 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Elizabeth237, what workflow are you using? I don't see the link you are talking about here: https://gatk.broadinstitute.org/hc/en-us/articles/360035535892-Somatic-copy-number-variant-discovery-CNVs-

    0
    Comment actions Permalink
  • Avatar
    sahuno

    Elizabeth237 

    1. gnomad resources for hg38 can be found in funcotator google bucket

    https://storage.cloud.google.com/broad-public-datasets/funcotator/gnomAD_2.1_VCF_INFO_AF_Only/hg38/

    There are two separate .vcf.gz files - one for exomes and another for genomes- and I wish i could tell you more about the differences between the two resources.

    gnomad.exomes.r2.1.sites.liftoverToHg38.INFO_ANNOTATIONS_FIXED.vcf.gz

     gnomad.genomes.r2.1.sites.liftoverToHg38.INFO_ANNOTATIONS_FIXED.vcf.gz

     

    2. there's another gnomad file (af-only-gnomad.hg38.vcf.gz) usually referred to in tutorial documentations stored in the gatk best-practices google bucket

    https://storage.cloud.google.com/gatk-best-practices/somatic-hg38/af-only-gnomad.hg38.vcf.gz?authuser=0

     

    which one is which or appropriate for a particular task, i will leave that to the wonderful GATK Team to comment on this.

     

    Kind regards

    Sam

    0
    Comment actions Permalink
  • Avatar
    Elizabeth237

    Hi Genevieve Brandt, the workflow is here: https://gatk.broadinstitute.org/hc/en-us/articles/360035890011#5.1. I cannot find the link provided by this page(https://gatk.broadinstitute.org/hc/en-us/articles/360035890011#5.1) is here: https://gatk.zendesk.com/hc/en-us/articles/360036212652. And like this:Thanks a lot !

     

    Thanks a lot 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Elizabeth237, did you see sahuno's comment above? Those look like the resources necessary for the tutorial!

    0
    Comment actions Permalink
  • Avatar
    Elizabeth237

    OK, I have handled. thank you very much.

    0
    Comment actions Permalink
  • Avatar
    jejacobs23

    Thank you sahuno, Elizabeth237 and Genevieve for your comments on this issue.  I'm still having the same problem with accessing the file from the GATK Resource Bundle.  As stated in my original post and in Elizabeth237's post, the link in footnote #9 is no longer valid.  You get an error 404, "The page you were looking for doesn't exist".  Furthermore, when I download the file from the link in sahuno's post

    wget https://storage.cloud.google.com/broad-public-datasets/funcotator/gnomAD_2.1_VCF_INFO_AF_Only/hg38/gnomad.genomes.r2.1.sites.liftoverToHg38.INFO_ANNOTATIONS_FIXED.vcf.gz

    I get a file that does not appear to be in .vcf format.  I think it's in html format.  I attempted to download the .vcf from gnomad directly but the file is much to large to work with.  I would greatly appreciate any help you could offer.  Is there a file that I can use for the "SNPs-only" .vcf that's required in the CollectAllelicCounts step of the CNV workflow?  Better yet, is there a way I can create this file myself?

    Thanks in advance,

    James. 

    0
    Comment actions Permalink
  • Avatar
    sahuno

    Hi jejacobs23

    you need the `gsutil` (google cloud command line software) to able to download google cloud objects.

    to install `gsutil` see here https://cloud.google.com/storage/docs/gsutil_install

     

    After installing, use this to download file to current directory

    `gsutil cp gs://broad-public-datasets/funcotator/gnomAD_2.1_VCF_INFO_AF_Only/hg38/gnomad.exomes.r2.1.sites.liftoverToHg38.INFO_ANNOTATIONS_FIXED.vcf.gz .`

     

    Note links to object always start with `gs://googleBucket/location/ofSomefile.ext`

    use `gsutil ls gs://googleBucket/` to list files in a particular bucket

     

    hope this helps

    Sam

     

    0
    Comment actions Permalink
  • Avatar
    jejacobs23

    Thanks sahuno.  That worked like a charm.

    0
    Comment actions Permalink
  • Avatar
    Yuanyuan Wu

    Dear GATK Team,

    I downloaded data from the following link. But it's different from the interval list example as the interval list "chr17_theta_snps.interval_list" showed 

    gatk --java-options "-Xmx3g" CollectAllelicCounts \
        -L chr17_theta_snps.interval_list \
        -I normal.bam \
        -R /gatk/ref/Homo_sapiens_assembly38.fasta \
        -O sandbox/hcc1143_N_clean.allelicCounts.tsv

    gnomad.exomes.r2.1.sites.liftoverToHg38.INFO_ANNOTATIONS_FIXED.vcf.gz

    Will it be fine if I directly use this variant file by using CollectAllelicCounts program? gnomad.exomes.r2.1.sites.liftoverToHg38.INFO_ANNOTATIONS_FIXED.vcf

    or I have to preprocess it by a certain rule?

    Thanks

    Yuanyuan

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Yuanyuan,

    The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.

    Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.

    We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.

    For context, check out our support policy.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Yuanyuan Wu

    Could you clarify your question? If you are following this tutorial, it provides example data. What is the difference you are seeing in the example data?

    Make sure to follow guidelines from this post.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Enrico Cocchi

    Ok guys I see we are able to download both a

    1. gnomAD GENOMES (29 GB)
    2. gnomAD EXOMES (2 GB)

    file from the GATK bundle.

    And they can be passed directly to CollectAllelicCounts --intervals.

    I see that if I try to pass the genomes file the pipeline hangs loading it (I guess the 29GB are kind of slow to be processed) and with the exomes file it gets to run.

    For a somatic CNV pipeline running on WGS hg38 samples:

    1. which one are we supposed to use, gnomAD GENOMES or EXOMES? (I guess genomes but it takes to long to load)
    2. do we need to preprocess the file somehow? (like create a simple BED with only positions from the VCF.GZ file and then pass to intervals?)

     

    Thank you a lot in advance for any help!

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Enrico Cocchi,

    1. If you have WGS data, then you should use the genomes file. 
    2. There is a note in the tutorial regarding the interval file: "The tool requires one or more genomic intervals specified with -L. The intervals can be either a Picard-style intervals list or a VCF. See Article#1109 for descriptions of formats. The sites should represent sites of common and/or sample-specific germline variant SNPs-only sites. Omit indel-type and mixed-variant-type sites." So, make sure the interval list is SNPs only. It is fine to have the interval file be a VCF file, there are a few different options for the format of interval lists.
    3. In regards to your concern about the size of the genomes file, it looks like it would be trivial to break up the genomes file by chromosome and run multiple instances of CollectAllelicCounts then combine the output tsv file.

    Thanks for providing all those links to this conversation to help out other users as well.

    Best,

    Genevieve

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk