HaplotypeCaller confuses introns as long deletions
I'm using GATK v4.1.9.0. When I run HaplotypeCaller on RNA-seq Illumina short reads I get several false-positive (FP) long deletions. Checking these FPs in IGV, I noticed that HaplotypeCaller is confusing introns as long deletions. I don't think it's a mapping issue, because when I use PacBio long reads they confirm the introns and no long deletions are called.
After aligning reads to the genome with STAR, adding read group with picard AddOrReplaceReadGroups, removing introns with SplitNCigarReads and calculating/applying base recalibration, I ran HaplotypeCaller as shown in the code below:
### Variant calling using GATK HaplotypeCaller with GVCF mode
mkdir ${OUTPUT_DIR}/intermediate_gvcfs
mkdir ${OUTPUT_DIR}/hc_bam
for i in `seq -f '%04g' 0 $loop_num`
do
gatk --java-options "-Xmx4G" HaplotypeCaller \
-R ${REF} \
-I ${OUTPUT_DIR}/base_recalibration_bams/${SAMPLE}_recal_$i.bam \
-O ${OUTPUT_DIR}/intermediate_gvcfs/${SAMPLE}_recal_$i.g.vcf \
-L ${SCATTERED_INTERVAL_LIST}/$i-scattered.interval_list \
--native-pair-hmm-threads 1 \
-ERC GVCF \
--bam-output ${OUTPUT_DIR}/hc_bam/hc.bam &
done
wait
Here is the log after finishing running this code:
I used the argument --bam-output to follow the instructions
"Try having HC generate the output bam (see tech doc for argument details) as that will show you what it thinks the region looks like after reassembly."
from https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/methods-and-algorithms/3891-calling-variants-in-rnaseq, where I believe this same issue was reported for an older version.
I don't know exactly what the BAM generated by HaplotypeCaller with --bam-output is, but looking at it in IGV we can see that the intron was converted into long deletion. See IGV screenshot below, where tracks are, from top to bottom:
- VCF file generated by GATK from Illumina short reads;
- BAM file with the Illumina short-read alignments (before using SplitNCigarReads to let us see introns);
- BAM file output by HaplotypeCaller when using argument --bam-output.
I would like to ask you, please, what would be the recommended pipeline to fix this problem?
I appreciate your quick response. Thanks in advance.
-
Hi Vladimir Souza,
Thank you for writing to the GATK forum! We greatly appreciate your patience while we ran some diagnostics.
After discussing your inquiry with our developers, I have some feedback and the next steps for you.
Could you please clarify whether you are including any introns in your intervals? Do you have any padding in your intervals?
You could try running a functional annotation tool after the fact to identify the intronic regions and filter them out. Of course, you can use any functional annotation tool, but we have a GATK tool called Funcotator that can do this for you.
I hope this helps! Please let me know if this leads you to success. Also, if you have any other questions, please do not hesitate to reach out.
Best,
Anthony
-
Hi Anthony Dias-Ciarla,
Thank you very much for responding.Here are some questions and what I have done so far.
Could you please clarify whether you are including any introns in your intervals?
Sorry, but I don't know exactly how to answer this. Do you mean the intervals specified in
-L ${SCATTERED_INTERVAL_LIST}/$i-scattered.interval_list
If so, they were created by the code
gatk --java-options "-Xmx4G -XX:+UseParallelGC -XX:ParallelGCThreads=$THREADS" ScatterIntervalsByNs \
-R $REF \
-O $ref_interval_dir/ref.interval_list
gatk --java-options "-Xmx4G -XX:+UseParallelGC -XX:ParallelGCThreads=$THREADS" SplitIntervals \
-R $REF \
-L $ref_interval_dir/ref.interval_list \
--scatter-count $THREADS \
-O $ref_interval_dir/ref.scattered.interval_listDo you have any padding in your intervals?
Do you mean the argument -ip? If so, I used the default value (0).
As you suggested, I tried Funcotator. But several true variants (I have a ground truth) were filtered out, leading to an extremely low recall. Here is the code that I used:
### Download pre-packaged data source
cd ${DATA_SOURCE_DIR}
gatk --java-options "-Xmx4G" FuncotatorDataSourceDownloader \
--germline \
--validate-integrity \
--extract-after-download
### Running Funcotator with base options
gatk --java-options "-Xmx4G" Funcotator \
--variant ${INPUT_VCF} \
--reference ${REF_FASTA} \
--ref-version hg38 \
--data-sources-path ${DATA_SOURCE_DIR}/funcotator_dataSources.v1.7.20200521g \
--output ${OUTPUT_VCF} \
--output-file-format VCFwhere ${INPUT_VCF} is the VCF file after processing HaplotypeCaller's output (GVCF mode) with GenotypeGVCFs, GatherVcfs, indel/SNP quality score recalibration (VariantRecalibrator/ApplyVQSR), and filtering out non-PASS variants (bcftools).
Please, let me know if you have any feedback.
Best regards.
-
Hi,
I am having the same issue. Have you figured it out ?
Thanks
-
Hi
Have you tried disabling the use of softclipped bases in HaplotypeCaller using the option? This option is supposed to be activated since SplitNCigarReads command splits reads with N Cigar and generates supplementary alignments that match the continuation of the actual RNA alignment.
--dont-use-soft-clipped-bases
Please sign in to leave a comment.
4 comments