Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

ApplyBQSR

0

10 comments

  • Avatar
    David Roazen

    Hi,

    Could you please clarify whether your recalibrated output file (Abi1_recal.bam) is missing any reads that were present in the input file (Abi1_dedup.bam)? Are you saying that there are MT-aligned reads present in the input bam that are not showing up in the recalibrated output bam?

    You might also try running with the additional options --use-jdk-inflater and --use-jdk-deflater and see if that makes a difference.

    Regards,

    David

    0
    Comment actions Permalink
  • Avatar
    Wondessen Ayalew

    Dear David,
    Thank you for your kind replay. When I check the recalibrated bam file MT is avilable.

    0
    Comment actions Permalink
  • Avatar
    Wondessen Ayalew

    While i am running haplotypecaller, it took more than 6 days for one sample to generate g.vcf file. Menwhile, I am convinced to call only the MT variants and continue the nuclear variants later.
    Would you please provide me support in this regard.
    Thank you once again for your remarkable support.

    0
    Comment actions Permalink
  • Avatar
    Atal Saha

    Hi Wondessen Ayalew,

    Did you get any support on this? I am having the same problem - it took a week to finish generating g.vcf for 1 sample. The command I used:

    /gpfs/gpfs0/software/rhel7/eucleia/gatk-4.2.6.1/gatk --java-options "-Xmx260G" HaplotypeCaller -R /gpfs/gpfs0/scratch/Ecogenome/monkfish/reference_genome_monkfish/chr_level_assembly/LOPHIUS_GENOME_and_ANNOTATION/bf2_chromosomelevel.masked.fasta  -I  /gpfs/gpfs0/scratch/Ecogenome/reSultS/bam/Sample_15-LOP-095.dedup.fixed.bam -O /gpfs/gpfs0/scratch/Ecogenome/reSultS/hapcaller/monkfish/Sample_15-LOP-095.g.vcf > Sample_15-LOP-095.log

    so, surprised that even with 260 GB memory its so slow! Would be grateful if anyone has suggestion that can help speeding up the process.

     

    thanks,

    Atal

    0
    Comment actions Permalink
  • Avatar
    David Roazen

    Hi Atal Saha / Wondessen Ayalew,

    The most common way to speed up HaplotypeCaller is to parallelize by genomic interval using the -L option, either in a local cluster or on the cloud, and then combine the outputs using MergeVcfs or CombineGVCFs. The basic idea is to run HaplotypeCaller many times in parallel, each with a different -L interval, and then merge the outputs at the end. Users typically will parallelize at least by chromosome, and often more finely. We publish a cloud-based workflow in Terra that can do this parallelization for you here

    If you don't have access to a cluster, and don't want to run on the cloud, you can try running HaplotypeCallerSpark, which is able to parallelize HaplotypeCaller using multiple threads on a local machine.

    One other thing you should check is whether you're running on an Intel/AMD CPU, or another architecture such as M1. GATK does not currently have good support for M1 chips, and tools like HaplotypeCaller will run very slowly on such machines.

    Regards,
    David

    0
    Comment actions Permalink
  • Avatar
    Atal Saha

    Hi David,

    Thanks very much for your reply on this.

    Running Spark version was much faster, but I am worried about warning about this version (as it says that we should not use spark if we care about results!). Are results from spark reliable? also spark is not generating .idx files. Does spark not generate .idx file by default as like the normal haplotypecaller?

    thanks for your help with this,

    Atal

    0
    Comment actions Permalink
  • Avatar
    David Roazen

    Hi Atal Saha,

    HaplotypeCallerSpark is a thin wrapper that just calls directly into the regular HaplotypeCaller code, so the results should be extremely close. However, because the Spark version "shards" the input data across multiple threads, there may be calling artifacts near the shard boundaries -- though a lot of work has been done to minimize this possibility. It's also possible that certain arguments that work for regular HaplotypeCaller may not work with HaplotypeCallerSpark. For these reasons, we hesitate to endorse the Spark version for clinical / production use, but for more casual purposes it should be perfectly fine to use. You may need to manually index the output VCF (eg., using GATK's IndexFeatureFile) after running the Spark version of the tool. 

    What we actually do in production here at the Broad is to use an interval list with carefully-chosen split points at areas of the reference that are filled with N's (such as at the centromeres), and then launch many HaplotypeCaller tasks at once to call variants for these intervals. This approach eliminates the possibility of calling artifacts near the interval boundaries.

    Regards,
    David

     

    0
    Comment actions Permalink
  • Avatar
    Atal Saha

    Thanks very much, David.

     

    I did manage to produce all my vcf files. Seemed using 20-30 gb memory was a good solution.

    However, I accidentally did not include -ERC GVCF option while running haplotypecalling and now struglling to run combine vcfs and to run jointgenotyping. Running haplotypecalling one more time for all my samples will again take 2 weeks, so, is there a way out here?

     

    Thanks again,

    Atal

    0
    Comment actions Permalink
  • Avatar
    David Roazen

    Hi Atal Saha,

    Unfortunately, without the reference confidence scores produced by -ERC GVCF you will be unable to run joint genotyping using GATK. I'm afraid your only option is to re-call your samples.

    Sorry!

    David

    0
    Comment actions Permalink
  • Avatar
    Atal Saha

    Thanks again, David.

    Re-calling started.

     

    cheers

    Atal

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk