FastaAlternateReferenceMaker creating chimera
I’m using FastaAlternateReferenceMaker to obtain the sequence of the Alternate sample but after doing some testing, it seems that it is not doing what I expected this function to do (maybe my expectation is wrong!); it rather creates chimera sequences (even when a single sample is used).
I’ve mapped Illumina data from one single sample on a reference genome and after following the GATK best practices, I obtained a gvcf file (hence information about all sites, including those with DP=0 is included); the last commands are as follows (as expected, using “–ERC GVCF” for HaplotypeCaller results in a similar gvcf file in terms of information).
gatk HaplotypeCaller -ploidy 1 -I $In.bam -O Out.ERCBP.vcf.gz -R Ref.fas --native-pair-hmm-threads 40 --add-output-vcf-command-line true -ERC BP_RESOLUTION
gatk GenotypeGVCFs -R Ref.fas --include-non-variant-sites true -V Out.ERCBP.vcf.gz -O Out.ERCBP.Geno.vcf.gz
I then obtained the sequence from the alternate allele using FastaAlternateReferenceMaker:
gatk FastaAlternateReferenceMaker -R Ref.fas -O Out.ERCBP.Geno.fasta -V Out.ERCBP.Geno.vcf.gz
As far as I could understand, FastaAlternateReferenceMaker gives the sequence of the alternate allele. However, after some testing, it turns out this is not quite the case. When there is no mapping at some sites of the reference (i.e. DP=0), the FastaAlternateReferenceMaker function uses the Reference allele/sequence instead of reporting ‘N’ or ‘-‘ or any other value indicating ‘no data’. I tested this for small DNA fragments (10, 100 or 1000 bp) and large contigs (>2 Mb), and obtained the same behaviour of the function irrespective of fragment length with no mapping (absence of mapping checked on the gvcf). As a result, unless 100% of the sites of the reference sequences are covered by the mapped samples, which is uncommon, the FastaAlternateReferenceMaker function provides a chimera sequence between the Reference and Alternate. I could not find any argument to address this ‘no data’ issue.
I’m using gatk 220.127.116.11 & Java 17.0.5 but it appears that the behaviour of the function is similar across versions. If this behaviour is expected, maybe it could be useful to list it in the ‘Caveats’ section in the Tool Index. In that case, there does not seem to be much possibility to exclude regions with no mapping, unless maybe one uses the –intervals argument?
Please, let me know is this is an issue with the function or just the wrong interpretation of what it does?
I'm sorry to say that the tool just doesn't do what you're hoping it does. It sounds like what you want is to output only the reference sequence that is supported directly by your sequencing data, and substitute N everywhere else. The tool doesn't do that. It just substitutes the differences into existing reference and leaves anything unspecified as is.
It would be possible if you have a g.vcf which includes information about the reference confidence at non-variant sites. I don't think a regular vcf has enough information to decide where you would want Ns. If this is a feature you would like supported you could open a feature request ticket on the gatk github. We're a bit stretched thin right now so I can't promise it will happen, but it's probably not a huge lift to implement so maybe someone would have a chance to do so.
Please sign in to leave a comment.