Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenomicsDBImport error: USER ERROR has occurred: Duplicate sample: 1. Sample was found in both file:

Completed
0

11 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi Yuna Son, you may have a problem with your read groups, and have named different samples as "Sample". This does not work for GATK because different samples need to be named accordingly. Please see this document on read groups: https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups

    2
    Comment actions Permalink
  • Avatar
    Yuna Son

    Hi Genevieve Brandt,

    Thank you for your reply. I am trying to change the read groups and re-run the process. I assume that I did the same sample name in the read groups and this causes the problem as you suggested. I will update if I have more issues. Thank you!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Yuna Son, thank you for the update and glad I could help!

    0
    Comment actions Permalink
  • Avatar
    Yangyxt

    Dear Brandt,

    Can I ask why the GenomicDBImport error is correlated with Read Group info. It is an info stored in BAM files while we only input vcf or g.vcf files to GenomicDBImport.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Yangyxt the read group in the BAM file changes the sample column in the VCF file. Please see this for more info: https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups

    0
    Comment actions Permalink
  • Avatar
    Yangyxt

    Dear Brandt,

    Thank you for your response. I understand RG info determines the sample name in VCF column after FORMAT field. I have ran into a issue where the RG ID is not correct, whereas the RG SM field is the sample name we want. And the VCF sample column has the right sample name( from which I guess it is the SM field in @RG line instead of the ID field determining the VCF sample column label). Under this circumstance, I still receive a Duplicate Name Error when running GenomicsDBImport. 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Yangyxt could you post your command and stack trace so we can evaluate if you are facing the same issue?

    0
    Comment actions Permalink
  • Avatar
    jesus ix ballote

    Hello, I'm having the same issue. I decompressed the files, change the name that is duplicated and finally compress the file again, but this doesn't work. Aparently there is a problem with the compression part. How can I fix this problem?

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi jesus ix ballote,

    There are a few forum posts on this forum regarding compression issues. You can search for your specific error message on our site with this search into google:

    site:https://gatk.broadinstitute.org/hc/en-us/community/posts [Enter your error message here]

    If that doesn't solve your problem, make a new post and we'll help you from there.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    LG

    Hi Genevieve Brandt (she/her),

    This post has some time but I hope I can still get your help. I'm having the same error when running GenomicsDBImport before CreateSomaticPanelOfNormals. The input are VCF files produced with Mutect2 from a series of normal tissue bam files. Before Mutect2, I have already used AddOrReplaceReadGroups to assign specific RGID to those bam, and I confirmed that they have indeed different IDs using samtools view:

    samtools view -H Example1.bam | grep '^@RG' :

    @RG     ID:Example1       LB:lib1 PL:ILLUMINA     SM:2    PU:unit1 

    samtools view -H Example2.bam | grep '^@RG' :

    @RG     ID:Example2       LB:lib1 PL:ILLUMINA     SM:2    PU:unit1

    What would be the reason to still get the error "USER ERROR has occurred: Duplicate sample: 1. Sample was found in both file"? How could I solve the problem?

    Thank you very much in advance for any help.

    Best,

    Lina.

     

     

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    LG in the example you gave, both SM (sample names) are the same SM = 2. Could you clarify why you have two bam files with the same sample? You should merge these bam files before the Mutect2 step if they are from the same sample.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk