A USER ERROR has occurred: Duplicate sample: PD4086bv2. Sample was found in both file
AnsweredREQUIRED for all errors and issues:
Hello, I'm doing a panel of normals (PON) and I'm having a problem. Apparently caused because differents files have the same name of the tumor sample.
a) GATK version used: 4.2.2.0
b) Exact command used:
gatk GenomicsDBImport -R Homo_sapiens_assembly19.fasta --genomicsdb-workspace-path pon_db -V 4666_1.vcf.gz -V 4672_2.vcf.gz -V 4732_8.vcf.gz -V 4862_2.vcf.gz -V 4862_3.vcf.gz -V 4862_4.vcf.gz -V 4862_5.vcf.gz -V 4862_6.vcf.gz -V 4862_7.vcf.gz -V 4862_8.vcf.gz -V 4886_5.vcf.gz -V 4886_6.vcf.gz -V 4886_7.vcf.gz -V 4886_8.vcf.gz -V 4902_1.vcf.gz -V 4902_2.vcf.gz -V 4902_3.vcf.gz -V 4902_4.vcf.gz -V 4902_5.vcf.gz -V 4902_6.vcf.gz -V 4902_7.vcf.gz -V 4902_8.vcf.gz -V 4943_1.vcf.gz -V 4943_2.vcf.gz -V 4943_3.vcf.gz -V 4943_4.vcf.gz -V 4943_5.vcf.gz -V 4943_7.vcf.gz -V 4943_8.vcf.gz -V 4988_1.vcf.gz -V 4988_2.vcf.gz -V 4988_3.vcf.gz -V 4988_4.vcf.gz -V 5043_1.vcf.gz -V 5043_2.vcf.gz -V 5043_3.vcf.gz -V 5043_4.vcf.gz -V 5043_5.vcf.gz -V 5043_6.vcf.gz -V 5043_7.vcf.gz -V 5043_8.vcf.gz -V 5051_1.vcf.gz -V 5051_2.vcf.gz -V 5051_3.vcf.gz -V 5051_4.vcf.gz -V 5051_5.vcf.gz -V 5051_6.vcf.gz -V 5051_7.vcf.gz -V 5051_8.vcf.gz -V 5090_1.vcf.gz -V 5090_2.vcf.gz -V 5090_3.vcf.gz -V 5090_4.vcf.gz -V 5090_5.vcf.gz -V 5090_6.vcf.gz -V 5090_7.vcf.gz -V 5090_8.vcf.gz -V 5105_1.vcf.gz -V 5105_2.vcf.gz -V 5105_3.vcf.gz -V 5105_4.vcf.gz -V 5105_5.vcf.gz -V 5105_6.vcf.gz -V 5105_7.vcf.gz -V 5105_8.vcf.gz
c) Entire program log:
A USER ERROR has occurred: Duplicate sample: PD4086bv2. Sample was found in both file:///home/adrianib/gatk-4.2.2.0/Genome_Analyzer_II/4862_2.vcf.gz and 4672_2.vcf.gz.
When I saw this error, I decompressed the file, oppeden and change the name "PD4086bv2" in one of the files (4672_2.vcf). After that, I save the changes and compressed the file again.
Then, I proceed to run the command of the letter b) one more time, but I have another error:
A USER ERROR has occurred: Failed to create reader from file:///home/adrianib/gatk-4.2.2.0/Genome_Analyzer_II/4672_2.vcf.gz because of the following error:
Unable to parse header with error: Invalid GZIP header, for input source: file:///home/adrianib/gatk-4.2.2.0/Genome_Analyzer_II/4672_2.vcf.gz
So, I'm lost since here. How can I fix this error and create my PON?
Thank you in advance.
-
Hi jesus ix ballote,
It looks like you either had an issue with the zipping and unzipping process, or you did not correctly change the sample name in the file.
You can unzip and zip the file back up or look into what might have went wrong with the sample renaming.
Let me know what solves the problem!
Genevieve
-
Hello Genevieve Brandt (she/her), thanks for answering.
I am using the gzip command to zip and unzip the files. Is this command correct or which one can I use?
-
yes that command works!
-
I'm proving two possible arguments to solve the error.
One is: --disable-read-filter NotDuplicateReadFilter
gatk GenomicsDBImport \
-R Homo_sapiens_assembly19.fasta \
--disable-read-filter NotDuplicateReadFilter \
--genomicsdb-workspace-path pon_db \
-V 4666_1.vcf.gz \
-V 4672_2.vcf.gz \
-V 4732_8.vcf.gz \
-V 4862_2.vcf.gz \
-V 4862_3.vcf.gz \
-V 4862_4.vcf.gz \
-V 4862_5.vcf.gz \
-V 4862_6.vcf.gz \
-V 4862_7.vcf.gz \
-V 4862_8.vcf.gz \
-V 4886_5.vcf.gz \
-V 4886_6.vcf.gz \
-V 4886_7.vcf.gz \
-V 4886_8.vcf.gz \
-V 4902_1.vcf.gz \
-V 4902_2.vcf.gz \
-V 4902_3.vcf.gz \
-V 4902_4.vcf.gz \
-V 4902_5.vcf.gz \
-V 4902_6.vcf.gz \
-V 4902_7.vcf.gz \
-V 4902_8.vcf.gz \
-V 4943_1.vcf.gz \
-V 4943_2.vcf.gz \
-V 4943_3.vcf.gz \
-V 4943_4.vcf.gz \
-V 4943_5.vcf.gz \
-V 4943_7.vcf.gz \
-V 4943_8.vcf.gz \
-V 4988_1.vcf.gz \
-V 4988_2.vcf.gz \
-V 4988_3.vcf.gz \
-V 4988_4.vcf.gz \
-V 5043_1.vcf.gz \
-V 5043_2.vcf.gz \
-V 5043_3.vcf.gz \
-V 5043_4.vcf.gz \
-V 5043_5.vcf.gz \
-V 5043_6.vcf.gz \
-V 5043_7.vcf.gz \
-V 5043_8.vcf.gz \
-V 5051_1.vcf.gz \
-V 5051_2.vcf.gz \
-V 5051_3.vcf.gz \
-V 5051_4.vcf.gz \
-V 5051_5.vcf.gz \
-V 5051_6.vcf.gz \
-V 5051_7.vcf.gz \
-V 5051_8.vcf.gz \
-V 5090_1.vcf.gz \
-V 5090_2.vcf.gz \
-V 5090_3.vcf.gz \
-V 5090_4.vcf.gz \
-V 5090_5.vcf.gz \
-V 5090_6.vcf.gz \
-V 5090_7.vcf.gz \
-V 5090_8.vcf.gz \
-V 5105_1.vcf.gz \
-V 5105_2.vcf.gz \
-V 5105_3.vcf.gz \
-V 5105_4.vcf.gz \
-V 5105_5.vcf.gz \
-V 5105_6.vcf.gz \
-V 5105_7.vcf.gz \
-V 5105_8.vcf.gzAnd the other one is: --read-filter AllowAllReadsReadFilter
gatk GenomicsDBImport \
-R Homo_sapiens_assembly19.fasta \
--read-filter AllowAllReadsReadFilter \
--genomicsdb-workspace-path pon_db \
-V 4666_1.vcf.gz \
-V 4672_2.vcf.gz \
-V 4732_8.vcf.gz \
-V 4862_2.vcf.gz \
-V 4862_3.vcf.gz \
-V 4862_4.vcf.gz \
-V 4862_5.vcf.gz \
-V 4862_6.vcf.gz \
-V 4862_7.vcf.gz \
-V 4862_8.vcf.gz \
-V 4886_5.vcf.gz \
-V 4886_6.vcf.gz \
-V 4886_7.vcf.gz \
-V 4886_8.vcf.gz \
-V 4902_1.vcf.gz \
-V 4902_2.vcf.gz \
-V 4902_3.vcf.gz \
-V 4902_4.vcf.gz \
-V 4902_5.vcf.gz \
-V 4902_6.vcf.gz \
-V 4902_7.vcf.gz \
-V 4902_8.vcf.gz \
-V 4943_1.vcf.gz \
-V 4943_2.vcf.gz \
-V 4943_3.vcf.gz \
-V 4943_4.vcf.gz \
-V 4943_5.vcf.gz \
-V 4943_7.vcf.gz \
-V 4943_8.vcf.gz \
-V 4988_1.vcf.gz \
-V 4988_2.vcf.gz \
-V 4988_3.vcf.gz \
-V 4988_4.vcf.gz \
-V 5043_1.vcf.gz \
-V 5043_2.vcf.gz \
-V 5043_3.vcf.gz \
-V 5043_4.vcf.gz \
-V 5043_5.vcf.gz \
-V 5043_6.vcf.gz \
-V 5043_7.vcf.gz \
-V 5043_8.vcf.gz \
-V 5051_1.vcf.gz \
-V 5051_2.vcf.gz \
-V 5051_3.vcf.gz \
-V 5051_4.vcf.gz \
-V 5051_5.vcf.gz \
-V 5051_6.vcf.gz \
-V 5051_7.vcf.gz \
-V 5051_8.vcf.gz \
-V 5090_1.vcf.gz \
-V 5090_2.vcf.gz \
-V 5090_3.vcf.gz \
-V 5090_4.vcf.gz \
-V 5090_5.vcf.gz \
-V 5090_6.vcf.gz \
-V 5090_7.vcf.gz \
-V 5090_8.vcf.gz \
-V 5105_1.vcf.gz \
-V 5105_2.vcf.gz \
-V 5105_3.vcf.gz \
-V 5105_4.vcf.gz \
-V 5105_5.vcf.gz \
-V 5105_6.vcf.gz \
-V 5105_7.vcf.gz \
-V 5105_8.vcf.gzBut in both cases I'm having the same error as the beginning:
A USER ERROR has occurred: Duplicate sample: PD4086bv2. Sample was found in both file:///home/adrianib/gatk-4.2.2.0/Genome_Analyzer_II/4862_2.vcf.gz and 4672_2.vcf.gz.
I'm taking those arguments from here:
https://gatk.broadinstitute.org/hc/en-us/articles/4405443657499-Mutect2
It's like the program is ignoring those two arguments.
There is a specific order to write those two arguments or can be in any place?
-
Hi jesus ix ballote,
I would not recommend that you solve the problem with those arguments. Instead, I would recommend that you fix the error in your files which is that multiple of your files have the same sample name.
It looks like the files from the error message don't have a proper sample name, instead, the sample name is just "Sample". You should change the sample names in these files so that they are all unique and match the sample names in the file.
Best,
Genevieve
-
Hi, Genevieve Brandt (she/her)
You were right, I had to change the name of the patient in all the files. This error consist in that diferent files are from the same patient and there fore have the same patient's name on it.
So when the patient's name is changed in all files so that it is not the same name in two or more files, it will no longer flag the error.
I did the following steps:
1. Decompress the file with the bgzip command:
bgzip -d file.vcf.gz
2. Open it
3. Change this two names (in bold) inside the each file:
##tumor_sample=Patient_1
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Patient_14. Save the changes
5. Compress the file
bgzip file.vcf
6. Finally, it is important re-index the file. I did with this command:
tabix -p vcf file.vcf.gz
My error was not re-index the file. But when I did, GATK works!
Thanks for your advice!
-
Thank you for posting your solution jesus ix ballote!
Please sign in to leave a comment.
7 comments