Force genotype with specific reference-allele combinations
Hello,
I am trying to use PharmCAT to produce pharmacogenomic calls (https://pharmcat.org/using/VCF-Requirements/). PharmCAT requires a genotype for all of its pharmacovariants. This becomes problematic when an insertion or deletion is not present in the VCF because it is represented by the reference call at each position rather than the indel which is not present. It is discussed more here (https://github.com/PharmGKB/PharmCAT/issues/128).
For example
Here is the PharmCAT position VCF
#CHROM POS ID REF ALT QUAL FILTER
chrX 154536021 . CAGA C . PASS
Here is the single sample gvcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample
chrX 154536021 . C . . . DP=59 GT:DP:RGQ 0/0:59:99
chrX 154536022 . A . . . DP=59 GT:DP:RGQ 0/0:59:99
chrX 154536023 . G . . . DP=59 GT:DP:RGQ 0/0:59:99
chrX 154536024 . A . . . DP=59 GT:DP:RGQ 0/0:59:99
I would like to see something like this
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample
chrX 154536021 . CAGA C . PASS PX=G6PD GT:GQ:DP 0/0
Do you have any suggestions for modifying my GATK commands?
I have tried to joint genotype with the pharmcat_position.vcf which produces genotypes at some positions but not all.
Thanks so much for your consideration.
REQUIRED for all errors and issues:
- a) GATK version used:
4.6
- b) Exact command used:
$gatk_path --java-options "-Xmx4G" HaplotypeCaller -R $reference_grch38 \
-L $pharmcat_git/pharmcat_positions.vcf \
-I $cram_path \
-O $gvcf_single_sample \
-ip 25 \
-ERC GVCF
$gatk_path --java-options "-Xmx4g" GenotypeGVCFs \
-R $reference_grch38 \
-V $gvcf_single_sample \
-O $gt_sample_vcf --include-non-variant-sites true
- c) Entire program log:
Using GATK jar /gatk-4.6.0.0/gatk-package-4.6.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx4G -jar gatk-4.6.0.0/gatk-package-4.6.0.0-local.jar HaplotypeCaller -R ./Input/Homo_sapiens_assembly38.fasta -L /pharmcat_positions.vcf -I /sample.cram -O /sample.g.vcf -ip 25 -ERC GVCF
15:08:36.375 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk-4.6.0.0/gatk-package-4.6.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
15:08:36.591 INFO HaplotypeCaller - ------------------------------------------------------------
15:08:36.594 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.6.0.0
15:08:36.595 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
15:08:36.595 INFO HaplotypeCaller - Executing as on Linux v5.10.102.1-microsoft-standard-WSL2 amd64
15:08:36.595 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v19.0.2+7-Ubuntu-0ubuntu322.04
15:08:36.596 INFO HaplotypeCaller - Start Date/Time: August 5, 2024 at 3:08:36 PM EDT
15:08:36.596 INFO HaplotypeCaller - ------------------------------------------------------------
15:08:36.596 INFO HaplotypeCaller - ------------------------------------------------------------
15:08:36.598 INFO HaplotypeCaller - HTSJDK Version: 4.1.1
15:08:36.598 INFO HaplotypeCaller - Picard Version: 3.2.0
15:08:36.598 INFO HaplotypeCaller - Built for Spark Version: 3.5.0
15:08:36.599 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
15:08:36.599 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
15:08:36.599 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
15:08:36.600 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
15:08:36.600 INFO HaplotypeCaller - Deflater: IntelDeflater
15:08:36.600 INFO HaplotypeCaller - Inflater: IntelInflater
15:08:36.601 INFO HaplotypeCaller - GCS max retries/reopens: 20
15:08:36.601 INFO HaplotypeCaller - Requester pays: disabled
15:08:36.602 INFO HaplotypeCaller - Initializing engine
15:08:37.173 INFO FeatureManager - Using codec VCFCodec to read file file:///pharmcat_positions.vcf
15:08:37.229 INFO IntervalArgumentCollection - Processing 24850 bp from intervals
15:08:37.239 INFO HaplotypeCaller - Done initializing engine
15:08:37.241 INFO HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
15:08:37.352 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/gatk-4.6.0.0/gatk-package-4.6.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
15:08:37.354 INFO NativeLibraryLoader - Loading libgkl_smithwaterman.so from jar:file:/gatk-4.6.0.0/gatk-package-4.6.0.0-local.jar!/com/intel/gkl/native/libgkl_smithwaterman.so
15:08:37.356 INFO SmithWatermanAligner - Using AVX accelerated SmithWaterman implementation
15:08:37.359 INFO HaplotypeCallerEngine - Standard Emitting and Calling confidence set to -0.0 for reference-model confidence output
15:08:37.359 INFO HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
15:08:37.371 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/gatk-4.6.0.0/gatk-package-4.6.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
15:08:37.382 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
15:08:37.384 INFO IntelPairHmm - Available threads: 20
15:08:37.384 INFO IntelPairHmm - Requested threads: 4
15:08:37.384 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
15:08:37.447 INFO ProgressMeter - Starting traversal
15:08:37.448 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
15:08:39.961 WARN InbreedingCoeff - InbreedingCoeff will not be calculated at position chr1:97883329 and possibly subsequent; at least 10 samples must have called genotypes
15:08:47.553 INFO ProgressMeter - chr19:15885993 0.2 150 890.7
15:08:50.722 INFO HaplotypeCaller - 2169 read(s) filtered by: MappingQualityReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
0 read(s) filtered by: NotSecondaryAlignmentReadFilter
3910 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilter
6079 total reads filtered out of 45565 reads processed
15:08:50.722 INFO ProgressMeter - chrX:154535151 0.2 326 1473.7
15:08:50.723 INFO ProgressMeter - Traversal complete. Processed 326 total regions in 0.2 minutes.
15:08:50.814 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.0023517390000000003
15:08:50.814 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.076751521
15:08:50.815 INFO SmithWatermanAligner - Total compute time in native Smith-Waterman : 0.07 sec
15:08:50.816 INFO HaplotypeCaller - Shutting down engine
[August 5, 2024 at 3:08:50 PM EDT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.24 minutes.
Runtime.totalMemory()=664797184
done with variant calling
Using GATK jar /gatk-4.6.0.0/gatk-package-4.6.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx4g -jar /gatk-4.6.0.0/gatk-package-4.6.0.0-local.jar GenotypeGVCFs -R ./Input/Homo_sapiens_assembly38.fasta -V /Sample.g.vcf -O /sample.single.gt.vcf --include-non-variant-sites true
15:08:54.414 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk-4.6.0.0/gatk-package-4.6.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
15:08:54.644 INFO GenotypeGVCFs - ------------------------------------------------------------
15:08:54.647 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.6.0.0
15:08:54.648 INFO GenotypeGVCFs - For support and documentation go L005 on Linux v5.10.102.1-microsoft-standard-WSL2 amd64
15:08:54.648 INFO GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v19.0.2+7-Ubuntu-0ubuntu322.04
15:08:54.649 INFO GenotypeGVCFs - Start Date/Time: August 5, 2024 at 3:08:54 PM EDT
15:08:54.649 INFO GenotypeGVCFs - ------------------------------------------------------------
15:08:54.649 INFO GenotypeGVCFs - ------------------------------------------------------------
15:08:54.651 INFO GenotypeGVCFs - HTSJDK Version: 4.1.1
15:08:54.652 INFO GenotypeGVCFs - Picard Version: 3.2.0
15:08:54.652 INFO GenotypeGVCFs - Built for Spark Version: 3.5.0
15:08:54.653 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
15:08:54.653 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
15:08:54.654 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
15:08:54.654 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
15:08:54.654 INFO GenotypeGVCFs - Deflater: IntelDeflater
15:08:54.655 INFO GenotypeGVCFs - Inflater: IntelInflater
15:08:54.655 INFO GenotypeGVCFs - GCS max retries/reopens: 20
15:08:54.656 INFO GenotypeGVCFs - Requester pays: disabled
15:08:54.656 INFO GenotypeGVCFs - Initializing engine
15:08:54.964 INFO FeatureManager - Using codec VCFCodec to read file file:/sample.g.vcf
15:08:55.099 INFO GenotypeGVCFs - Done initializing engine
15:08:55.155 INFO ProgressMeter - Starting traversal
15:08:55.157 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
15:08:55.557 WARN ReferenceConfidenceVariantContextMerger - Detected invalid annotations: When trying to merge variant contexts at location chr1:97883329 the annotation MLEAC=[2, 0] was not a numerical value and was ignored
15:08:55.608 WARN InbreedingCoeff - InbreedingCoeff will not be calculated at position chr1:97883329 and possibly subsequent; at least 10 samples must have called genotypes
15:08:56.512 INFO ProgressMeter - chrX:154534123 0.0 24850 1100369.0
15:08:56.513 INFO ProgressMeter - Traversal complete. Processed 24850 total variants in 0.0 minutes.
15:08:56.724 INFO GenotypeGVCFs - Shutting down engine
[August 5, 2024 at 3:08:56 PM EDT] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=759169024
-
Can you try adding the parameter
--output-mode EMIT_ALL_ACTIVE_SITES
to the HaplotypeCaller command? It should be able to emit INDELs properly. If you also wish to capture MNPs found within the pharmcat positions file you should set
--max-mnp-distance 1
Keep in mind that this parameter may convert some of those SNPs to MNPs as well if they are in close proximity to another SNP so you need to do some post processing to eliminate unwanted MNPs from that data.
Finally. GVCF mode will generate additional need post processing so instead of using GVCF mode we recommend running HaplotypeCaller without emitting reference confidence blocks so you can omit -ERC parameter.
I hope this helps.
Regards.
-
Thank you so much for the suggestion but it unfortunately did not work! This command excluded almost all of the non-variant positions (I'm not sure why it included some of the non-variant positions) and it excluded most of the indels. Thank you so much for your help!
GATK 4.6
Command
$gatk_path --java-options "-Xmx4G" HaplotypeCaller -R $reference_grch38 \
-L $pharmcat_git/pharmcat_positions.vcf \
-I $bam_path \
-O $vcf_single_sample \
-ip 25 \
--output-mode EMIT_ALL_ACTIVE_SITES
Chromosome 2 positions from $pharmcat_git/pharmcat_positions.vcf
chr2 233759924 rs887829 C T . PASS PX=UGT1A1 GT 0/0
chr2 233760233 rs3064744 CAT C,CATAT,CATATAT . PASS PX=UGT1A1 GT 0/0
chr2 233760498 rs4148323 G A . PASS PX=UGT1A1 GT 0/0
chr2 233760973 rs35350960 C A . PASS PX=UGT1A1 GT 0/0
Expected/desired chromosome 2 output for my sample
chr2 233759924 rs887829 C T . PASS PX=UGT1A1 GT 1/1
chr2 233760233 . C CAT 1109.03 . AC=2;AF=1.00;AN=2;BaseQRankSum=-0.866;DP=36;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;QD=34.66;ReadPosRankSum=1.355;SOR=0.569 GT:AD:DP:GQ:PL 1/1:1,31:32:76:1123,76,0
chr2 233760233 rs3064744 CAT C . PASS PX=UGT1A1 GT 0/0
chr2 233760233 rs3064744 CAT CATAT . PASS PX=UGT1A1 GT 0/0
chr2 233760233 rs3064744 CAT CATATAT . PASS PX=UGT1A1 GT 0/0
chr2 233760498 rs4148323 G A . PASS PX=UGT1A1 GT 0/0
chr2 233760973 rs35350960 C A . PASS PX=UGT1A1 GT 0/0
output from $vcf_single_sample for chromosome 2
chr2 233759924 . C T 152.96 . AC=2;AF=1.00;AN=2;DP=5;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=30.59;SOR=1.981 GT:AD:DP:GQ:PL 1/1:0,5:5:15:167,15,0
chr2 233760233 . C CAT 1109.03 . AC=2;AF=1.00;AN=2;BaseQRankSum=-0.866;DP=36;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;QD=34.66;ReadPosRankSum=1.355;SOR=0.569 GT:AD:DP:GQ:PL 1/1:1,31:32:76:1123,76,0
-
To add a few additional options I tried which gave similar output.
$gatk_path --java-options "-Xmx4G" HaplotypeCaller -R $reference_grch38 \
-L $pharmcat_git/pharmcat_positions.vcf \
-I $bam_path \
-O $vcf_single_sample \
-ip 25 \
--output-mode EMIT_ALL_CONFIDENT_SITES
To add a few additional options I tried
$gatk_path --java-options "-Xmx4G" HaplotypeCaller -R $reference_grch38 \
-L $pharmcat_git/pharmcat_positions.vcf \
-I $bam_path \
-O $vcf_single_sample \
-ip 25 \
--all-site-pls true \
--output-mode EMIT_ALL_CONFIDENT_SITES
$gatk_path --java-options "-Xmx4G" HaplotypeCaller -R $reference_grch38 \
-L $pharmcat_git/pharmcat_positions.vcf \
-I $bam_path \
-O $vcf_single_sample \
-ip 25 \
--output-mode EMIT_ALL_ACTIVE_SITES
$gatk_path --java-options "-Xmx4G" HaplotypeCaller -R $reference_grch38 \
-L $pharmcat_git/pharmcat_positions.vcf \
-I $bam_path \
-O $vcf_single_sample \
-ip 25 \
--all-site-pls true \
--output-mode EMIT_ALL_ACTIVE_SITES
I should also add that the CRAM is an exome and I would happily accept ./. as the genotype if its missing. I have updated the expected/desired below
Chromosome 2 positions from $pharmcat_git/pharmcat_positions.vcf
chr2 233759924 rs887829 C T . PASS PX=UGT1A1 GT 0/0
chr2 233760233 rs3064744 CAT C,CATAT,CATATAT . PASS PX=UGT1A1 GT 0/0
chr2 233760498 rs4148323 G A . PASS PX=UGT1A1 GT 0/0
chr2 233760973 rs35350960 C A . PASS PX=UGT1A1 GT 0/0
Expected/desired chromosome 2 output for my sample
chr2 233759924 rs887829 C T . PASS PX=UGT1A1 GT 1/1
chr2 233760233 . C CAT 1109.03 . AC=2;AF=1.00;AN=2;BaseQRankSum=-0.866;DP=36;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;QD=34.66;ReadPosRankSum=1.355;SOR=0.569 GT:AD:DP:GQ:PL 1/1:1,31:32:76:1123,76,0
chr2 233760233 rs3064744 CAT C . PASS PX=UGT1A1 GT 0/0
chr2 233760233 rs3064744 CAT CATAT . PASS PX=UGT1A1 GT 0/0
chr2 233760233 rs3064744 CAT CATATAT . PASS PX=UGT1A1 GT 0/0
chr2 233760498 rs4148323 G A . PASS PX=UGT1A1 GT 0/0 or ./.
chr2 233760973 rs35350960 C A . PASS PX=UGT1A1 GT 0/0 or ./.
output from $vcf_single_sample for chromosome 2
chr2 233759924 . C T 152.96 . AC=2;AF=1.00;AN=2;DP=5;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=30.59;SOR=1.981 GT:AD:DP:GQ:PL 1/1:0,5:5:15:167,15,0
chr2 233760233 . C CAT 1109.03 . AC=2;AF=1.00;AN=2;BaseQRankSum=-0.866;DP=36;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;QD=34.66;ReadPosRankSum=1.355;SOR=0.569 GT:AD:DP:GQ:PL 1/1:1,31:32:76:1123,76,0
-
Hi again.
Can you make sure that you are providing --alleles parameter to force call the genotype status of each of those sites ?
You need to provide the pharmcat vcf file as --alleles parameter as well.
-
The following worked beautifully
$gatk_path --java-options "-Xmx4G" HaplotypeCaller -R $reference_grch38 \--alleles $input_decomposed_vcf \-I $bam_path \-O $vcf_single_sample \-L $pharmcat_git/pharmcat_positions.vcf \-ip 25 \--output-mode EMIT_ALL_ACTIVE_SITESThere is a slight issue with MNVs which i will now investigate. Thank you so so much for your feedback. -
I should add that i used vt decompose for the pharmcat positions. I don't think it was necessary. Thanks.
-
Great to hear that the problem is solved. If you wish to add MNPs to your callset then you need to adjust the MNP distance parameter I mentioned above and do a cleanup and merge once you call once with MNPs and once without. Calling with MNP will include additional non-desired ref and alt alleles for certain positions.
Regards.
-
added the mnv option and now everything works! Thank you!!!
Please sign in to leave a comment.
8 comments