Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

HaplotypeCaller confuses introns as long deletions

0

4 comments

  • Avatar
    Anthony DiCi

    Hi Vladimir Souza,

    Thank you for writing to the GATK forum! We greatly appreciate your patience while we ran some diagnostics.

    After discussing your inquiry with our developers, I have some feedback and the next steps for you.

    Could you please clarify whether you are including any introns in your intervals? Do you have any padding in your intervals?

    You could try running a functional annotation tool after the fact to identify the intronic regions and filter them out. Of course, you can use any functional annotation tool, but we have a GATK tool called Funcotator that can do this for you.

    I hope this helps! Please let me know if this leads you to success. Also, if you have any other questions, please do not hesitate to reach out.

    Best,

    Anthony

    0
    Comment actions Permalink
  • Avatar
    Vladimir Souza

    Hi Anthony Dias-Ciarla,

    Thank you very much for responding.

    Here are some questions and what I have done so far.

     

    Could you please clarify whether you are including any introns in your intervals?

    Sorry, but I don't know exactly how to answer this. Do you mean the intervals specified in 

    -L ${SCATTERED_INTERVAL_LIST}/$i-scattered.interval_list

    If so, they were created by the code 

    gatk --java-options "-Xmx4G -XX:+UseParallelGC -XX:ParallelGCThreads=$THREADS" ScatterIntervalsByNs \
      -R $REF \
    -O $ref_interval_dir/ref.interval_list

    gatk --java-options "-Xmx4G -XX:+UseParallelGC -XX:ParallelGCThreads=$THREADS" SplitIntervals \
      -R $REF \
      -L $ref_interval_dir/ref.interval_list \
      --scatter-count $THREADS \
    -O $ref_interval_dir/ref.scattered.interval_list

     

    Do you have any padding in your intervals?

    Do you mean the argument -ip? If so, I used the default value (0).

     

    As you suggested, I tried Funcotator. But several true variants (I have a ground truth) were filtered out, leading to an extremely low recall. Here is the code that I used:

    ### Download pre-packaged data source
    cd ${DATA_SOURCE_DIR}
    gatk --java-options "-Xmx4G" FuncotatorDataSourceDownloader \
      --germline \
      --validate-integrity \
      --extract-after-download

    ### Running Funcotator with base options
    gatk --java-options "-Xmx4G" Funcotator \
      --variant ${INPUT_VCF} \
      --reference ${REF_FASTA} \
      --ref-version hg38 \
      --data-sources-path ${DATA_SOURCE_DIR}/funcotator_dataSources.v1.7.20200521g \
      --output ${OUTPUT_VCF} \
    --output-file-format VCF

    where ${INPUT_VCF} is the VCF file after processing HaplotypeCaller's output (GVCF mode) with GenotypeGVCFs, GatherVcfs, indel/SNP quality score recalibration (VariantRecalibrator/ApplyVQSR), and filtering out non-PASS variants (bcftools).

     

    Please, let me know if you have any feedback.

    Best regards. 

    0
    Comment actions Permalink
  • Avatar
    Gianfilippo Coppola

    Hi,

    I am having the same issue. Have you figured it out ?

    Thanks

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Hi

    Have you tried disabling the use of softclipped bases in HaplotypeCaller using the option? This option is supposed to be activated since SplitNCigarReads command splits reads with N Cigar and generates supplementary alignments that match the continuation of the actual RNA alignment. 

    --dont-use-soft-clipped-bases
    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk