Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

emit-ref-confidence error using single sample BAM



  • Official comment
    Genevieve Brandt (she/her)

    suzy_bunters & Alia Parveen,

    Read groups are necessary for using GATK, you may just need to add them to your file. Usually read groups are added during alignment, but you can add them to your BAM with AddOrReplaceReadGroups. This document has more information about how to do that:

    Hope this helps!



    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi suzy_bunters, yes we will get to this as soon as we are able. Please see our support policy. I think this problem has been addressed on the forum before, so I would recommend going through other forum posts if you want a solution ASAP. 

    Comment actions Permalink
  • Avatar

    Hi there, is anyone able to help on this? 


    Comment actions Permalink
  • Avatar

    Hi Genevieve, thanks for your reply (and sorry to nag) :D I checked the forum before posting but the other solutions I found don't work for me. I'll be more patient though!

    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi suzy_bunters, could you print out the read group lines in the BAM header? You can see this doc for more information:

    Comment actions Permalink
  • Avatar

    Hi Genevieve Brandt (she/her),

    There are no @RG lines in the header (the headers either start with @HD, @SQ, or @PG).

    The original fasta files from which the bam was generated were trimmed using Trimmomatic -  would that have removed the read group lines? 

    Comment actions Permalink
  • Avatar
    Alia Parveen

    I also encountered the same error and my sorted bam files do not have @RG lines.

    I was given trimmed fastq files (2 years old) that I aligned with the latest reference genome using BWA and generated .sam, .bam, and sorted.bam files using Samtools. First time using GATK for gVCF and VCF generation.

    bwa mem susScr11.fasta SV11_R1.fastq.gz SV11_R2.fastq.gz > SV11.sam
    samtools view -S -b SV11.sam > SV11.bam

    samtools sort SV11.bam -o SV11.s.bam
    samtools index SV11.s.bam
    java -jar gatk-package- HaplotypeCaller -R susScr11.fasta -I SV11.s.bam -O SV11.g.vcf.gz -ERC GVCF


    A USER ERROR has occurred: Argument emit-ref-confidence has a bad value: Can only be used in single sample mode currently. Use the --sample-name argument to run on a single sample out of a multi-sample BAM file.

    Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

    Comment actions Permalink
  • Avatar
    Alia Parveen

    Thank you and it worked. I have one more question. I see way too many SNPs in my final vcf (after the genotype call) in IGV. I see SNPs that are only in one of the samples in either case or control group that I should ignore by removing them. Is there a GATK function to remove such SNPs or do I have to come up with my own code for it? Thank you. 

    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Alia Parveen, glad to hear that it worked!

    For the next question, make sure you are doing filtering. You can check out our best practices here. You can also search through the forum for other users with a similar question to see how they refined their variants. If you are not able to figure out a solution, make a new post on the forum since it is a different question.




    Comment actions Permalink
  • Avatar
    Spencer Monckton

    I hate to revive this answered question, but I'm running into this same error now, and I'm finding the instructions for adding read group information to be ambiguous. The documentation about read groups states: "When multiplexing is involved, then each subset of reads originating from a separate library run on that lane will constitute a separate read group," and "In Illumina data, read group IDs are composed using the flowcell name and lane number". My data comes from 70 libraries/samples that were pooled and run on one Illumina flowcell, in a single lane. So which is it? According to the first statement, each library/sample should have a unique ID, but according to the second statement, they should all have the same ID field.

    How do HaplotypeCaller and other downstream tools differentiate between samples/libraries after VCF files are eventually merged? Should I give each of my 70 samples a unique ID, LB, or SM field, or some combination of the above? Genevieve Brandt (she/her) are you able to clarify?

    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Spencer Monckton I think for your case, each library/sample should get a unique identifier (ID). Different samples need to be separated, and the different libraries should be separated as well.

    Samples will be identified by the SM sample name. 

    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk