CollectAllelicCounts interval_list
AnsweredGood afternoon,
I am working with GATK 4.1.7.0 and am attempting to use the CollectAllelicCounts function as part of a somatic CNV Workflow. I need some additional help with the interval list that is to be used with the "-L" option. In Footnote #9 of the (How to part II) Sensitively detect copy ratio alterations and allelic segments tutorial, it makes reference to the SNPs-only gnomAD VCF files available in the GATK Resource Bundle but unfortunately that link is no longer valid. When I go to the Google Cloud Platform (https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0) I don't see the gnomAD SNPs-only VCF file as an option to download. Can you please help me with this? Is this file available for download somewhere? Or am supposed to create it myself as part of the workflow? I am working with WGS data from tumor and matched normal samples.
-
Hi ,
The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.
Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.
We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.
For context, check out our support policy.
-
hi,
I ran into the same issue. The gnomAD SNPs-only VCF file(hg38) can not be downloaded. If this file is not available, could you provide me with some SelectVariants parameters or methods to help me generate the file by myself.
thanks a lot
-
Hi Elizabeth237, what workflow are you using? I don't see the link you are talking about here: https://gatk.broadinstitute.org/hc/en-us/articles/360035535892-Somatic-copy-number-variant-discovery-CNVs-
-
1. gnomad resources for hg38 can be found in funcotator google bucket
https://storage.cloud.google.com/broad-public-datasets/funcotator/gnomAD_2.1_VCF_INFO_AF_Only/hg38/
There are two separate .vcf.gz files - one for exomes and another for genomes- and I wish i could tell you more about the differences between the two resources.
gnomad.exomes.r2.1.sites.liftoverToHg38.INFO_ANNOTATIONS_FIXED.vcf.gz
gnomad.genomes.r2.1.sites.liftoverToHg38.INFO_ANNOTATIONS_FIXED.vcf.gz
2. there's another gnomad file (af-only-gnomad.hg38.vcf.gz) usually referred to in tutorial documentations stored in the gatk best-practices google bucket
which one is which or appropriate for a particular task, i will leave that to the wonderful GATK Team to comment on this.
Kind regards
Sam
-
Hi Genevieve Brandt, the workflow is here: https://gatk.broadinstitute.org/hc/en-us/articles/360035890011#5.1. I cannot find the link provided by this page(https://gatk.broadinstitute.org/hc/en-us/articles/360035890011#5.1) is here: https://gatk.zendesk.com/hc/en-us/articles/360036212652. And like this:Thanks a lot !
Thanks a lot
-
Hi Elizabeth237, did you see sahuno's comment above? Those look like the resources necessary for the tutorial!
-
OK, I have handled. thank you very much.
-
Thank you sahuno, Elizabeth237 and Genevieve for your comments on this issue. I'm still having the same problem with accessing the file from the GATK Resource Bundle. As stated in my original post and in Elizabeth237's post, the link in footnote #9 is no longer valid. You get an error 404, "The page you were looking for doesn't exist". Furthermore, when I download the file from the link in sahuno's post
wget https://storage.cloud.google.com/broad-public-datasets/funcotator/gnomAD_2.1_VCF_INFO_AF_Only/hg38/gnomad.genomes.r2.1.sites.liftoverToHg38.INFO_ANNOTATIONS_FIXED.vcf.gz
I get a file that does not appear to be in .vcf format. I think it's in html format. I attempted to download the .vcf from gnomad directly but the file is much to large to work with. I would greatly appreciate any help you could offer. Is there a file that I can use for the "SNPs-only" .vcf that's required in the CollectAllelicCounts step of the CNV workflow? Better yet, is there a way I can create this file myself?
Thanks in advance,
James.
-
Hi jejacobs23
you need the `gsutil` (google cloud command line software) to able to download google cloud objects.
to install `gsutil` see here https://cloud.google.com/storage/docs/gsutil_install
After installing, use this to download file to current directory
`gsutil cp gs://broad-public-datasets/funcotator/gnomAD_2.1_VCF_INFO_AF_Only/hg38/gnomad.exomes.r2.1.sites.liftoverToHg38.INFO_ANNOTATIONS_FIXED.vcf.gz .`
Note links to object always start with `gs://googleBucket/location/ofSomefile.ext`
use `gsutil ls gs://googleBucket/` to list files in a particular bucket
hope this helps
Sam
-
Thanks sahuno. That worked like a charm.
-
Dear GATK Team,
I downloaded data from the following link. But it's different from the interval list example as the interval list "chr17_theta_snps.interval_list" showed
gatk --java-options "-Xmx3g" CollectAllelicCounts \ -L chr17_theta_snps.interval_list \ -I normal.bam \ -R /gatk/ref/Homo_sapiens_assembly38.fasta \ -O sandbox/hcc1143_N_clean.allelicCounts.tsv
gnomad.exomes.r2.1.sites.liftoverToHg38.INFO_ANNOTATIONS_FIXED.vcf.gz
Will it be fine if I directly use this variant file by using CollectAllelicCounts program? gnomad.exomes.r2.1.sites.liftoverToHg38.INFO_ANNOTATIONS_FIXED.vcf
or I have to preprocess it by a certain rule?
Thanks
Yuanyuan
-
Hi Yuanyuan,
The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.
Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.
We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.
For context, check out our support policy.
-
Hi Yuanyuan Wu,
Could you clarify your question? If you are following this tutorial, it provides example data. What is the difference you are seeing in the example data?
Make sure to follow guidelines from this post.
Best,
Genevieve
-
Ok guys I see we are able to download both a
- gnomAD GENOMES (29 GB)
- gnomAD EXOMES (2 GB)
file from the GATK bundle.
And they can be passed directly to CollectAllelicCounts --intervals.
I see that if I try to pass the genomes file the pipeline hangs loading it (I guess the 29GB are kind of slow to be processed) and with the exomes file it gets to run.
For a somatic CNV pipeline running on WGS hg38 samples:
- which one are we supposed to use, gnomAD GENOMES or EXOMES? (I guess genomes but it takes to long to load)
- do we need to preprocess the file somehow? (like create a simple BED with only positions from the VCF.GZ file and then pass to intervals?)
Thank you a lot in advance for any help!
-
Hi Enrico Cocchi,
- If you have WGS data, then you should use the genomes file.
- There is a note in the tutorial regarding the interval file: "The tool requires one or more genomic intervals specified with
-L
. The intervals can be either a Picard-style intervals list or a VCF. See Article#1109 for descriptions of formats. The sites should represent sites of common and/or sample-specific germline variant SNPs-only sites. Omit indel-type and mixed-variant-type sites." So, make sure the interval list is SNPs only. It is fine to have the interval file be a VCF file, there are a few different options for the format of interval lists. - In regards to your concern about the size of the genomes file, it looks like it would be trivial to break up the genomes file by chromosome and run multiple instances of CollectAllelicCounts then combine the output tsv file.
Thanks for providing all those links to this conversation to help out other users as well.
Best,
Genevieve
Please sign in to leave a comment.
15 comments