Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

PathSeq error: java.lang.NumberFormatException

0

8 comments

  • Avatar
    Gökalp Çelik

    Hi Roshan Kumar

    What are your locale settings within your compute environment? Are you using anything other than en_US locale?

    0
    Comment actions Permalink
  • Avatar
    Roshan Kumar

    We have the following settings on compute nodes: 

    $llocale

    LANG=en_US.UTF-8

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi again. 

    Looks like you might be hitting an edge case where aligner is emitting a CIGAR string instead of a number. Can you provide us a sample data that we can reproduce it as well?

    There is an article for how to upload sample files for us to check. 

    https://gatk.broadinstitute.org/hc/en-us/articles/360035889671-How-do-I-submit-a-detailed-bug-report 

    If you are able to just let us know from this topic and we will look into this issue in deep.

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Roshan Kumar

    I have uploaded a zip file with data and log file. Like the original version of PathSeq I was expecting, input as fastq files. But, in gatk-pathseq teh input file is bam. Therefore I generated this unaligned (want to keep both host and microbe reads) bam file. Any direct approach to generate the bam file?

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Thank you Roshan Kumar we will certainly take a look at shortly. 

    Can you tell us the name of the file you uploaded? 

    0
    Comment actions Permalink
  • Avatar
    Roshan Kumar

    Pathseq-error.zip

    0
    Comment actions Permalink
  • Avatar
    Roshan Kumar

    I am eagerly waiting for your response.

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi

    I noticed that your bwa img file and microbe dictionary files are differently named. Are you sure that the dictionary file belongs to the pathseq_host.fa.img file?

    Secondly, when I checked the code throwing error it looks like one of your reads is containing a malformed XA tag therefore instead of the mismatch bases it gets the CIGAR string in the tag. Can you try running the below command?

    gatk PathSeqBwaSpark  \
       --paired-input sorted_unaligned_S5000_3D.bam \
       --paired-output output_reads_paired.bam \
       --microbe-bwa-image /hpc/refdata/pathseq/microbe/pathseq_host.fa.img \
       --microbe-dict /hpc/refdata/pathseq/microbe/pathseq_microbe.dict

    This command will produce the bam file that PathSeqPipelineSpark uses to analyze. 

    Then the main quest will be checking XA tags of your reads. 

    We want you to check your bam output using 

    gatk FilterSamReads

    tool with the script code below

    function accept(rec) {
        if(rec.hasAttribute("XA"))
        {
            var xastring = rec.getStringAttribute("XA");
            var xaarray = xastring.split(";");
            for(var i=0; i<xaarray.length; i++)
            {
                var subxa = xaarray[i];
                var subxaarray = subxa.split(",");
                if(subxaarray.length>4)
                    print(rec.getSAMString());
            }
        }
      return false;
    }

    accept(record);

    Save this script as filter.js and run the below command.

    gatk FilterSamReads -JS filter.js -I output_reads_paired.bam -O filtered.bam --FILTER includeJavascript

    This code will start checking all XA tags within reads and will throw the SAM string of reads with violating XA tags. Then we want you to post some of those reads here so that we can find the actual cause of it. Possibly a fasta name error somewhere in the reference or dictionary. 

    Can you help us with this one so that we can solve this issue faster?

    Regards. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk