GATK not replacing variants in reference sequence
Background: My goal is to reference based reconstruct a group of loci for ~2k Pf20k samples. I've have identified the Malariagen Pf20k samples that I'm interested in working with and filtered these accordingly. For space sake I've taken the original VCF and cut it down to only include loci of interest and for this example reduced it to a single sample. I've also filtered out any variants that do not meet Pf20k QC i.e. PASS.
I've created a custom bed file that lists the coordinates of this locus that I want to reconstruct to be used with the -L option. Reference is the original Pf genome listed in the VCF header and .fai/.dict files were generated from this reference.
Problem: The output fasta of the reconstructions are all the same and match reference. It seems FastaAlternateReferenceMaker is not replacing the nucleotides with the appropriate variants. I've aligned two different output.fastas to each other and they are 100% identical. Furthermore, I took their two input VCFs and filtered them for only heterozygous positions and they are indeed not the same so the outputs should be different.
REQUIRED for all errors and issues:
a) GATK version used:
4.4.0.0 and 4.2.2.0 (4.2.2.0 was used to assess whether this was a version specific issue given other dependencies on the server)
b) Exact command used:
/usr/local/packages/gatk-4.2.2.0/gatk FastaAlternateReferenceMaker -R /local/projects-t3/p_falciparum/ryan.scalsky/Thesis/MalariaGen_Samples/Reference_based_reconstruction/Reference_FASTA_genome/Pfalciparum.genome.fasta -O test_SPT_kemi_deomnstrate.fasta -L /local/projects-t3/p_falciparum/ryan.scalsky/Thesis/MalariaGen_Samples/Reference_based_reconstruction/Reference_bed_files/PF3D7_0703900.bed -V /local/scratch/ryan.scalsky/test/SampleFiltered_Pf3D7_07_v3_PF3D7_0703900_SPT43376.recode_pass_only.recode.vcf
c) Entire program log:
[ryan.scalsky@thanos test]$ /usr/local/packages/gatk-4.2.2.0/gatk FastaAlternateReferenceMaker -R /local/projects-t3/p_falciparum/ryan.scalsky/Thesis/MalariaGen_Samples/Reference_based_reconstruction/Reference_FASTA_genome/Pfalciparum.genome.fasta -O test_SPT_2.fasta -L /local/projects-t3/p_falciparum/ryan.scalsky/Thesis/MalariaGen_Samples/Reference_based_reconstruction/Reference_bed_files/PF3D7_0703900.bed -V /local/scratch/ryan.scalsky/test/SampleFiltered_Pf3D7_07_v3_PF3D7_0703900_SPT43376.recode_pass_only.recode.vcf
Using GATK jar /usr/local/packages/gatk-4.2.2.0/gatk-package-4.2.2.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /usr/local/packages/gatk-4.2.2.0/gatk-package-4.2.2.0-local.jar FastaAlternateReferenceMaker -R /local/projects-t3/p_falciparum/ryan.scalsky/Thesis/MalariaGen_Samples/Reference_based_reconstruction/Reference_FASTA_genome/Pfalciparum.genome.fasta -O test_SPT_2.fasta -L /local/projects-t3/p_falciparum/ryan.scalsky/Thesis/MalariaGen_Samples/Reference_based_reconstruction/Reference_bed_files/PF3D7_0703900.bed -V /local/scratch/ryan.scalsky/test/SampleFiltered_Pf3D7_07_v3_PF3D7_0703900_SPT43376.recode_pass_only.recode.vcf
09:20:57.994 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/usr/local/packages/gatk-4.2.2.0/gatk-package-4.2.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
09:21:01.228 INFO FastaAlternateReferenceMaker - ------------------------------------------------------------
09:21:01.229 INFO FastaAlternateReferenceMaker - The Genome Analysis Toolkit (GATK) v4.2.2.0
09:21:01.229 INFO FastaAlternateReferenceMaker - For support and documentation go to https://software.broadinstitute.org/gatk/
09:21:01.231 INFO FastaAlternateReferenceMaker - Executing as ryan.scalsky@umaryland.edu on Linux v4.18.0-240.22.1.el8_3.x86_64 amd64
09:21:01.231 INFO FastaAlternateReferenceMaker - Java runtime: Java HotSpot(TM) 64-Bit Server VM v19+36-2238
09:21:01.232 INFO FastaAlternateReferenceMaker - Start Date/Time: March 26, 2024 at 9:20:57 AM EDT
09:21:01.232 INFO FastaAlternateReferenceMaker - ------------------------------------------------------------
09:21:01.232 INFO FastaAlternateReferenceMaker - ------------------------------------------------------------
09:21:01.233 INFO FastaAlternateReferenceMaker - HTSJDK Version: 2.24.1
09:21:01.233 INFO FastaAlternateReferenceMaker - Picard Version: 2.25.4
09:21:01.233 INFO FastaAlternateReferenceMaker - Built for Spark Version: 2.4.5
09:21:01.233 INFO FastaAlternateReferenceMaker - HTSJDK Defaults.COMPRESSION_LEVEL : 2
09:21:01.234 INFO FastaAlternateReferenceMaker - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
09:21:01.234 INFO FastaAlternateReferenceMaker - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
09:21:01.234 INFO FastaAlternateReferenceMaker - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
09:21:01.234 INFO FastaAlternateReferenceMaker - Deflater: IntelDeflater
09:21:01.234 INFO FastaAlternateReferenceMaker - Inflater: IntelInflater
09:21:01.234 INFO FastaAlternateReferenceMaker - GCS max retries/reopens: 20
09:21:01.234 INFO FastaAlternateReferenceMaker - Requester pays: disabled
09:21:01.235 INFO FastaAlternateReferenceMaker - Initializing engine
09:21:01.660 INFO FeatureManager - Using codec VCFCodec to read file file:///local/scratch/ryan.scalsky/test/SampleFiltered_Pf3D7_07_v3_PF3D7_0703900_SPT43376.recode_pass_only.recode.vcf
09:21:01.750 INFO FeatureManager - Using codec BEDCodec to read file file:///local/projects-t3/p_falciparum/ryan.scalsky/Thesis/MalariaGen_Samples/Reference_based_reconstruction/Reference_bed_files/PF3D7_0703900.bed
09:21:01.757 INFO IntervalArgumentCollection - Processing 13905 bp from intervals
09:21:01.764 INFO FastaAlternateReferenceMaker - Done initializing engine
09:21:01.794 INFO ProgressMeter - Starting traversal
09:21:01.795 INFO ProgressMeter - Current Locus Elapsed Minutes Bases Processed Bases/Minute
09:21:02.488 INFO ProgressMeter - Pf3D7_07_v3:166119 0.0 13905 1205635.8
09:21:02.488 INFO ProgressMeter - Traversal complete. Processed 13905 total bases in 0.0 minutes.
09:21:02.538 INFO FastaAlternateReferenceMaker - Shutting down engine
[March 26, 2024 at 9:21:02 AM EDT] org.broadinstitute.hellbender.tools.walkers.fasta.FastaAlternateReferenceMaker done. Elapsed time: 0.08 minutes.
Best,
Ryan
-
Hi Ryan Scalsky
Can you post the variants you are interested in to be included in the reference as they appear in the VCF file?
-
This run was for sample: SPT43376
This is the respective section of the VCF after filtering to only contain heterozygous variants.
##bcftools_viewCommand=view -i 'GT="0/1" || GT="1/0"' -Oz -o temp_SPT.vcf.gz SampleFiltered_Pf3D7_07_v3_PF3D7_0703900_SPT43376.recode_pass_only.recode.vcf; Date=Mon Mar 25 15:00:17 2024
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SPT43376
Pf3D7_07_v3 163428 . TTATTTA TTATTTATATTTA,TTATTTATATTTATATTTA,TTATTTATATTTATATTTATATTTA,T,TTATTTATATTTATATTTATATTTATATTTA,TTATTTATTTTTATATTTA 2.42157e+07 PASS
. GT:AD:DP:GQ:PGT:PID:PL:PS 0/1:12,198,0,0,0,0,0:210:99:.:.:5989,0,187,6029,791,6819,6029,791,6819,6819,6029,791,6819,6819,6819,6029,791,6819,6819,6819,6819,6029,791,6819,6819,6819,6819,6819:.
Pf3D7_07_v3 164456 . ATTATTATTATTATTATTATTATT *,TTTATTATTATTATTATTATTATT,GTTATTATTATTATTATTATTATT 1.76884e+06 PASS . GT:AD:DP:GQ:PGT:PID:PL:PS
0/1:4,0,0,0:69:99:.:.:2215,0,2140,1299,1320,2314,1299,1320,2314,2314:.
Pf3D7_07_v3 164457 . T *,C,TTAC 1.88546e+06 PASS . GT:AD:DP:GQ:PGT:PID:PL:PS 0/1:4,0,0,0:69:99:.:.:2215,0,2140,1299,1320,2314,1299,1320,2314,2314:.
Pf3D7_07_v3 164458 . TATTATTATTATTA *,TATCATTATTATTATTA,CATTATTATTATTA,TATTATTACTATCATTATTATTATTA,TACTATCATTATTATTATTA 1.60591e+06 PASS . GT:AD:DP:GQ:PGT:PID:PL:PS 0/1:4,0,0,0,0,0:69:99:.:.:2215,0,2140,1299,1320,2314,1299,1320,2314,2314,1299,1320,2314,2314,2314,1299,1320,2314,2314,2314,2314:.
Pf3D7_07_v3 164459 . ATTATT *,GTTATT 1.33498e+06 PASS . GT:AD:DP:GQ:PGT:PID:PL:PS 0/1:4,0,0:69:99:.:.:2215,0,2140,1299,1320,2314:.
Pf3D7_07_v3 164460 . TTATTA *,CTATTA,TTACTATTA 1.49859e+06 PASS . GT:AD:DP:GQ:PGT:PID:PL:PS 0/1:4,0,0,0:69:99:.:.:2215,0,2140,1299,1320,2314,1299,1320,2314,2314:.
Pf3D7_07_v3 164461 . TATTATTA *,CATTATTA,TACTATCATTATTA,TATTACTATCATTATTA,TATTATTACTATCATTATTA,TATTATTATTACTATCATTATTA 1.52404e+06 PASS . GT:AD:DP:GQ:PGT:PID:PL:PS 0/1:4,0,0,0,0,0,0:69:99:.:.:2215,0,2140,1299,1320,2314,1299,1320,2314,2314,1299,1320,2314,2314,2314,1299,1320,2314,2314,2314,2314,1299,1320,2314,2314,2314,2314,2314:.
Pf3D7_07_v3 164462 . ATTAT *,GTTAT 770480 PASS . GT:AD:DP:GQ:PGT:PID:PL:PS 0/1:4,0,0:69:99:.:.:2215,0,2140,1299,1320,2314:.
Pf3D7_07_v3 164463 . T *,C 825886 PASS . GT:AD:DP:GQ:PGT:PID:PL:PS 0/1:4,0,0:69:99:.:.:2215,0,2140,1299,1320,2314:.
(END) -
Hi Ryan Scalsky
Your input VCF seems to contain multiple alleles and especially complex overlapping indels and symbolics ("*") which altogether will suppress any nucleotide substitution.
Here is the document for the tool
https://gatk.broadinstitute.org/hc/en-us/articles/360037594571-FastaAlternateReferenceMaker
Here are points to consider when using this tool.
- If there are multiple variants that start at a site, it chooses one of them randomly.
- When there are overlapping indels (but with different start positions) only the first will be chosen.
- This tool works only for SNPs and for simple indels (but not for things like complex substitutions).
I hope this helps.
-
This is helpful but it doesn't explain why the SNPs, for example the last variant located at 164463 , are not being substituted in. I've ready the points you've listed in the documents for the tool but it isn't clear what defines a "complex substitution". Would we expect the tool to abort all function if any complex substitution is present or simply to ignore that variant?
-
The symbolics are probably reason why there is no substitution. Can you remove all those symbolics and just keep biallelic SNP and INDELS to see which ones get substituted when you run the tool with those?
-
Gökalp,
This makes sense. I imagine the tool cannot contribute missing or unknow variants, I see in this example all variants contain a symbolic. I will remove these then try again. Thank you!
Best,
Ryan
-
I reran the command with symbolics removed in one VCF and in the other I reran it with symbolics removed and only heterozygous sites included in the VCF. Both appear to have worked in that the sequences are not the same as the reference but they should be identical. For some reason they're not.
It appears that the symbolics were causing the script to abort. In reviewing the VCF with symbolics removed but not restricted to only variants that deviate from reference there appear to be insertions and deletions added to the sequence that aren't even in the VCF.
-
Adding additional information that may be relevant:
1. I reran this with a different sample so there was something to compare. When running GATK FastaAlternateReferenceMaker on the single sample vcf's containing only GT = 0/1 or 1/0 and no symbolics the output fasta sequence is the exact same, confirmed by multi alignment and by examining the .dict files but the M5 is different between both.
2. When comparing the output from the VCF with GT = 0/1 or 1/0 and no symoblics to the reference the variants that have been modified do not reflect the contents of the input VCF.
-
I think the contents are still reflecting what is inside the VCF however 2 rules are in effect. If there are multiple variants at a site a random one will be picked. Secondly INDELs are simplified by leftaligning and trimming before being applied to the reference fasta.
I am not be best to decide what would be the most optimal solution however I would be extra careful to integrate INDELs into a reference genome since all coordinates shift and annotations become incompatible however I would definitely only keep biallelic SNPs.
Please sign in to leave a comment.
9 comments