Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenotypeGVCFs after GenomicsDBImport reports reference file errors on specific intervals where a chr name changes

0

7 comments

  • Avatar
    Gökalp Çelik

    Hi rpeery

    This is definitely a reference issue which needs to be fixed on your end.

    Also the problem is not quite clear to us. What do you mean by Fasta files named with patterns? Are you using only a single reference file or do you have multiple references?

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    rpeery

    Dear Gökalp,

    Thank you for your reply. I understand that the error is pointing to the reference. However, I am at a loss as to what the issue may be since I have replaced all line endings (and used dos2unix) and checked them with cat -v as well as other tools and looked for any errant "-" symbols with grep. They do not exist. I also do not have any non-IUPAC codes. I've remade the reference files after searching as well with CreateSequenceDictionary and samtools faidx commands. The gvcf files are fine - why does it break at the genotypeGVCF stage?

    My reference is a multifasta file of 4223 regions from exome capture. I am then using the L flag to run 100 regions at a time to speed up processing. This generates 43 intervals to process. The 42nd interval contains a naming convention change following the pattern that I described in my initial post. When that changes the script fails, but the first 95% of the interval (#42) is processed correctly as well as intervals 1-41, and the last interval (#43) is not processed at all.  This seems to point to a specific location within the reference to investigate. Therefore, I pulled these last 30 regions as a test and run that separately, meaning I made a fasta file, dict file, fai file, and bed file for the interval and that runs through genotypeGVCFs fine. So, it seems to be something about switching naming conventions in the multifasta file (???), which is very odd. I did not change anything about the formatting of the test files except that they only contain the 30 regions that are named differently using underscores.

    The easiest solution is probably to merge this test run with the previous intervals in the gather stage. I guess I'm looking for additional advice beyond what has been previously posted for looking at a reference to see where the error might be popping up and as a new user, I'd like to understand why it only occurs at this stage and not in generating the gvcf files.

    This is what one of the interval file looks like - again working fine for intervals 1-41 and 95% of interval 42:
    $ cat -vT ../refs/interval_batch_43.bed | head
    jcf7190000012653_16019_16332^I0^I313
    jcf7190000017161_425650_434705^I0^I9055
    jcf7190000020666_546893_547206^I0^I313
    jcf7190000022858_2311_2748^I0^I437
    jcf7190000025510_493657_518980^I0^I25323
    jcf7190000031468_367518_368362^I0^I844
    jcf7190000033566_323114_323244^I0^I130
    jcf7190000037928_60698_61011^I0^I313
    jcf7190000043848_135677_136666^I0^I989
    jcf7190000044955_68566_69128^I0^I562

    If I look for the offending "-" in the reference (error for interval 42) with this code and many other permutations of it including counting characters I do not see any issues:

    $ grep -v '>' ../refs/Dfir_twistTargets.fasta | tr -d "\n\r ATGCNatgcn"

    If I look at line endings (error for interval 43) everything looks fine:
    $ file Dfir_twistTargets.fasta
    Dfir_twistTargets.fasta: ASCII text, with very long lines (2952)

    $ cat -e Dfir_twistTargets.fasta | grep -A 1 jcf

    All the line endings are "$" as expected.

    Any suggestions and clarifications welcome!

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi again. 

    Is it possible for you to post your fasta file and interval files in a single archive file using our issue submission ftp? We might want to take a look at that in detail.

    https://gatk.broadinstitute.org/hc/en-us/articles/360035889671-How-do-I-submit-a-detailed-bug-report

    Regards. 

     

    0
    Comment actions Permalink
  • Avatar
    rpeery

    Sorry for the delay - firewalls am I right?
    The files are uploaded as nameChangeUseCase.tar.gz. Thank you!

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi rpeery

    Looks like it was a simple fasta index issue. Once you modified your fasta file you should reindex and create new dictionary using samtools faidx/gatk IndexFeatureFile and samtools dict/gatk CreateSequenceDictionary tools. I was able to run your commands with a fresh dictionary and index without any issues. 

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    rpeery

    Thank you for your time looking into this. I had remade the dict and fai files several times, but the 7th time was the magic one (of course after reaching out). I am still curious why this flags at this stage (GenotypeGVCFs) and not earlier if there is an error in formatting for the reference files. Are the dict and fai files not accessed in the HaplotypeCaller stage?

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi again.

    I would certainly check for VCFs for validation as there may be unforeseen issues hanging within. Our tools depend on index and dictionaries however the way they consume them may show differences. 

    I hope this helps.

    Regards. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk