GenomicsDBImport error: USER ERROR has occurred: Duplicate sample: 1. Sample was found in both file:Completed
(REQUIRED) Please provide:
a) GATK version used: 188.8.131.52
b) Exact command used:
/gpfs/research/medicine/sequencer/NovaSeq/Outputs_fastq/2020_Outputs/Akash_Gunjan_05-19-2020_Yuna-samples/combined_fastq/com_fq/GATK/gatk-184.108.40.206/gatk GenomicsDBImport \
-R /gpfs/research/medicine/sequencer/NovaSeq/Outputs_fastq/2020_Outputs/Akash_Gunjan_05-19-2020_Yuna-samples/results/STAR/References/GRCh38.primary_assembly.genome.fa \
-L /gpfs/research/medicine/sequencer/NovaSeq/Outputs_fastq/2020_Outputs/Akash_Gunjan_05-19-2020_Yuna-samples/GATK_Bamfiles/Preprocessed_Bam/Ref/hg38_v0_HybSelOligos_whole_exome_illumina_coding_v1_whole_exome_illumina_coding_v1.Homo_sapiens_assembly38.targets.interval_list \
--genomicsdb-workspace-path pon_db \
-V HDF1_TA1_S4_normal.vcf.gz \
-V HDF1_TA2_S5_normal.vcf.gz \
-V HDF1_TA3_S6_normal.vcf.gz \
-V HDF1_UT1_S1_normal.vcf.gz \
-V HDF1_UT2_S2_normal.vcf.gz \
-V HDF1_UT3_S3_normal.vcf.gz \
-V HDF3_TA1_S10_normal.vcf.gz \
-V HDF3_TA2_S11_normal.vcf.gz \
-V HDF3_TA3_S12_normal.vcf.gz \
-V HDF3_UT1_S7_normal.vcf.gz \
-V HDF3_UT2_S8_normal.vcf.gz \
-V HDF3_UT3_S9_normal.vcf.gz \
-V HDFPA_TA1_S16_normal.vcf.gz \
-V HDFPA_TA2_S17_normal.vcf.gz \
-V HDFPA_TA3_S18_normal.vcf.gz \
-V HDFPA_UT1_S13_normal.vcf.gz \
-V HDFPA_UT2_S14_normal.vcf.gz \
c) Entire error log:
A USER ERROR has occurred: Duplicate sample: 1. Sample was found in both file:///gpfs/research/medicine/sequencer/NovaSeq/Outputs_fastq/2020_Outputs/Akash_Gunjan_05-19-2020_Yuna-samples/GATK_Bamfiles/Preprocessed_Bam/PON/HDF1_TA2_S5_normal.vcf.gz and HDF1_TA1_S4_normal.vcf.gz.
Choose a category for your question:
Why do I see this error message?
Hi Yuna Son, you may have a problem with your read groups, and have named different samples as "Sample". This does not work for GATK because different samples need to be named accordingly. Please see this document on read groups: https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups
Hi Genevieve Brandt,
Thank you for your reply. I am trying to change the read groups and re-run the process. I assume that I did the same sample name in the read groups and this causes the problem as you suggested. I will update if I have more issues. Thank you!
Hi Yuna Son, thank you for the update and glad I could help!
Can I ask why the GenomicDBImport error is correlated with Read Group info. It is an info stored in BAM files while we only input vcf or g.vcf files to GenomicDBImport.
Yangyxt the read group in the BAM file changes the sample column in the VCF file. Please see this for more info: https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups
Thank you for your response. I understand RG info determines the sample name in VCF column after FORMAT field. I have ran into a issue where the RG ID is not correct, whereas the RG SM field is the sample name we want. And the VCF sample column has the right sample name( from which I guess it is the SM field in @RG line instead of the ID field determining the VCF sample column label). Under this circumstance, I still receive a Duplicate Name Error when running GenomicsDBImport.
Hi Yangyxt could you post your command and stack trace so we can evaluate if you are facing the same issue?
Hello, I'm having the same issue. I decompressed the files, change the name that is duplicated and finally compress the file again, but this doesn't work. Aparently there is a problem with the compression part. How can I fix this problem?
Hi jesus ix ballote,
There are a few forum posts on this forum regarding compression issues. You can search for your specific error message on our site with this search into google:
site:https://gatk.broadinstitute.org/hc/en-us/community/posts [Enter your error message here]
If that doesn't solve your problem, make a new post and we'll help you from there.
Hi Genevieve Brandt (she/her),
This post has some time but I hope I can still get your help. I'm having the same error when running GenomicsDBImport before CreateSomaticPanelOfNormals. The input are VCF files produced with Mutect2 from a series of normal tissue bam files. Before Mutect2, I have already used AddOrReplaceReadGroups to assign specific RGID to those bam, and I confirmed that they have indeed different IDs using samtools view:
samtools view -H Example1.bam | grep '^@RG' :
@RG ID:Example1 LB:lib1 PL:ILLUMINA SM:2 PU:unit1
samtools view -H Example2.bam | grep '^@RG' :
@RG ID:Example2 LB:lib1 PL:ILLUMINA SM:2 PU:unit1
What would be the reason to still get the error "USER ERROR has occurred: Duplicate sample: 1. Sample was found in both file"? How could I solve the problem?
Thank you very much in advance for any help.
LG in the example you gave, both SM (sample names) are the same SM = 2. Could you clarify why you have two bam files with the same sample? You should merge these bam files before the Mutect2 step if they are from the same sample.
It has been a while since we've heard from you, so we'll be closing out this ticket in our system. However, please note that if you still require assistance, you need only respond to this thread, and we can swiftly re-open your ticket and pick up where we left off.
Please sign in to leave a comment.