Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Variant calling with PacBio HiFi reads

1

4 comments

  • Avatar
    WimS

    Genevieve Brandt

    I am also interested in if PacBio HiFi reads are already supported in GATK, and if so what parameters should be used.

    PacBio recommends either GATK4 or GoogleDeepVariant. So the PacBio HiFi BAMs are probably already accepted by GATK4.

    https://www.pacb.com/wp-content/uploads/Application-Brief-Variant-detection-using-whole-genome-sequencing-with-HiFi-reads-Best-Practices.pdf

    From this document I also pick up that GATK4 does not yet model in PacBio specific homopolymer sequencing errors, but just treats the reads as long Illumina reads? Is that correct? Is there a plan to model in the PacBio specific homopolymer sequencing errors?

    GoogleDeepVariants seem to currently be better in modeling the homopolymer sequencing errors, which results in better indel calling.

    SNPs should not be affected much by this, unless a homopolymer sequence is close by?

    I also wonder if you already tried to do a joint analysis (i.e.GenomicsDbImport, GenotypeGVCFs) on a combined set of PacBio HiFi and Illumina sequencing data?

    Thank you for the information.

    0
    Comment actions Permalink
  • Avatar
    WimS

    I could run a combined set of public Illumina and PacBio HiFi samples trough a standard bcbio alignment and variant calling analysis. BWA-mem was used for alignment, GATK4 for creating and merging GVCF files.

    I did not change any of the parameters, all the default paramaters in bcbio for analyzing Illumina data were used.

    https://github.com/bcbio/bcbio-nextgen/issues/3282#issuecomment-683669519

     

    The resulting BAM and multi-sample VCF file looked okay to me.

    There are false positive variants/genotypes for the PacBio HiFi samples.  And for some areas variants / genotypes were missing for the HiFi samples, were there are reads in the HiFi bam files. This might be caused by a bug in bcbio or GATK in combination with the HiFi BAM files.

    I did not find public 'truth' variant data for the public samples that I used.

    It would be good to test the bcbio pipelien and GATK software on HiFi data and then compare against a 'truth' variant data set. Genevieve Brandt It would be interesting to know if you already tested HiFi variant calling and then compared and optimized against 'truth' data.

     

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt

    I spoke to my team about this and we don't have any specific tests or comparisons to share, but we do have some insights. HiFi data should be able to get good variants because of the high quality. Most of the underlying assumptions that GATK makes should not hinder your results, as long as the data is good quality. However, you will definitely want to fine tune the parameters and test your results. After variant calling, make sure to filter your results using VariantFiltration or another tool, as the output from HaplotypeCaller is not meant to be the final result and can contain many false positives.

    Minimap2 is for long read alignment specifically and should get better results than bwa-mem. Here are the parameters we would recommend for aligning long read data: minimap2 -ayYL --MD --eqx -x asm20 

    The last resource we have available is the long read WDLs that our team has been working on. These are not featured workspaces so we do not provide user support for these options, but you may want to view them while building your pipelines.

    0
    Comment actions Permalink
  • Avatar
    WimS

    Thank you Genevieve Brandt and GATK team for the information.

    I am not sure which arguments to tweak, but switching to minimap2 and finding some truth data to compare against is a good place to start.

    We use the basic soft filtering expression in bcbio, that works with for all species without requiring training/truth data. For our purposes it works good enough with at least Illumina data. And also a lot (but certainly not all) of the HiFi false positive variants that I identified by eye were soft filtered using these expressions.

    https://github.com/bcbio/bcbio-nextgen/blob/6e009c5e3e0d64b180a42ab15ab411bee7cea065/bcbio/variation/vfilter.py#L183

    I am already very positively surprised that HiFi variant calling sort of works out of the box in GATK4, and look forward to more / improved long read functionality in GATK.

    Thank you.

     

     

     

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk