Somatic variant calling of WES using Gh38.p14 reference, and mutect_resources.wdl
REQUIRED for all errors and issues:
a) GATK version used: 4.6.0.0
b) Exact command used: mutect_resources.wdl
c) Entire program log:
Hello, I am following the mutect2 pipeline for somatic variant calling, and building the intermediate required files using the newer data set. I would like to share some observations and any insight into the legitimacy of my practice is welcomed.
1. First I used dragen-os to build reference mapping from the p14 version of GrCh38
Then I use dragen-os to map my tumor and normal WES data, then sort, mark & remove duplication and create index using samtools.
Observation:
a lot of the reads are mapped to
chr1_KI270706v1_random(or any other chromosomes), instead of just chr1(or any other chromosomes).
Question:
Was this normal? Is there anything I can do to reduce the number of reads mapping to unlocalized-scaffolds?
2. For the BQSR step, I used https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.40.gz
3. I used mutect_resources.wdl https://github.com/broadinstitute/gatk/blob/master/scripts/mutect2_wdl/mutect_resources.wdl
to create the af-only.vcf from gnomAD v4.1 https://gnomad.broadinstitute.org/downloads#v4
And slightly modified the code in mutect2_resources.wdl because some were not compatible with the newer gatk package. (I change the corresponding code to
# Zip the VCF:
bgzip -c simplified.vcf > ${output_name}.vcf.gz
# Index output file:
gatk --java-options "-Xmx64g" IndexFeatureFile -I ${output_name}.vcf.gz)
Observation:
[preemptible, disks, cpu, memory] is/are not supported by backend. Unsupported attributes will not be part of job executions.
Question:
I suppose the newer Cromwell package has made some big changes. Thus is it still recommended to use the mutect_resources.wdl?
Observation:
4. In the mutect2.wdl, the common-biallelic-snps.vcf created from mutect_resources.wdl are used as Mutect2_Multi.variants_for_contamination. However, I didn't find the usage of this file in the tutorial
https://gatk.broadinstitute.org/hc/en-us/articles/360037593851-Mutect2
Question: Is it necessary to use common-biallelic-snps.vcf with newest(4.6.0.0) version of GATK?
Thanks and looking forward to replies!
-
Hi D S
The answer to your first question would be to exclude those unlocalized contigs if you do not want those reads to map anywhere but primary contigs. However whose unlocalized contigs usually take over some of the clutter from primary contigs therefore you may need to analyze your data with and without these contigs if you have concerns. You may need to check if you have additional FP and TN calls accumulate as a result of either usage. Normally DRAGEN workflow has its recommendations for the masked reference sequence.
https://gatk.broadinstitute.org/hc/en-us/articles/17295731870235-Masked-reference-genomes
Common biallelic snps are used to calculate tumor segmentation and contamination therefore we recommend using it.
Cromwell options depend on the version as well as the server configuration therefore there may not be simple easy answer. Most of our workflows are already built in and ready to go under Terra however custom wdls may need further testing for compatibility.
I hope this helps.
-
Thank you so much for your answer! I think even those [preemptible, disks, cpu, memory] commands are not recognized, the system just go with the default settings, which still runs, but slower. Can I have two follow up questions?
I looked into the code of mutect_resources.wdl. gatk/scripts/mutect2_wdl/mutect_resources.wdl at master · broadinstitute/gatk · GitHub
If this was used to generate the files in best practice, I suppose using it on gnomAD 4.1 would generate valid Allele frequency only vcfs?
Secondly, in the SelectCommonBiallelicSNPs function, there is an option of minimum_allele_frequency. Is the value 0 recommanded for this function?
I am just thinking since gnomAD 4.1 is so big, generating AF-only vcf and common biallelic SNP from it would be more helpful.
Best,
-
Hi again.
We do have our resource files for such purposes you may use them as well insted of creating your own. We have a AF only gnomad source available inside.
https://storage.googleapis.com/gatk-best-practices/somatic-hg38/af-only-gnomad.hg38.vcf.gz
All other resource bundle files can be found in the following link
https://console.cloud.google.com/storage/browser/gatk-best-practices
For compatibility purposes you may need to readjust header sections of these resource files and remove any non-applicable contigs from the variant contexts. We usually work with variants in reference contigs therefore anything outside of chr1-22,X,Y is usually not useful unless you have specific purposes.
I hope this helps.
-
Dear Gökalp Çelik,
Thank you so much for your reply and information. I understand the adjusting header issue because the new dbSNP and grch38 use a different header.
The af-only file you provided uses gnomAD 2. I am just a little bit exploring on the outcome of using gnomAD 4.1, and comparing it with gnomAD2. I have been using other files, like indels vcf and pon from the best practices.
And thank you for the suggestion of removing the non-applicable contigs.
Hope you have a nice week. :D
Please sign in to leave a comment.
4 comments