Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

A USER ERROR has occurred: Duplicate sample: PD4086bv2. Sample was found in both file

Answered
0

7 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi jesus ix ballote,

    It looks like you either had an issue with the zipping and unzipping process, or you did not correctly change the sample name in the file.

    You can unzip and zip the file back up or look into what might have went wrong with the sample renaming.

    Let me know what solves the problem!

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    jesus ix ballote

    Hello Genevieve-Brandt-she-her, thanks for answering.

    I am using the gzip command to zip and unzip the files. Is this command correct or which one can I use?

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    yes that command works!

    0
    Comment actions Permalink
  • Avatar
    jesus ix ballote

    Hi Genevieve-Brandt-she-her

    I'm proving two possible arguments to solve the error.

    One is: --disable-read-filter NotDuplicateReadFilter

    gatk GenomicsDBImport \
    -R Homo_sapiens_assembly19.fasta \
    --disable-read-filter NotDuplicateReadFilter \
    --genomicsdb-workspace-path pon_db \
    -V 4666_1.vcf.gz \
    -V 4672_2.vcf.gz \
    -V 4732_8.vcf.gz \
    -V 4862_2.vcf.gz \
    -V 4862_3.vcf.gz \
    -V 4862_4.vcf.gz \
    -V 4862_5.vcf.gz \
    -V 4862_6.vcf.gz \
    -V 4862_7.vcf.gz \
    -V 4862_8.vcf.gz \
    -V 4886_5.vcf.gz \
    -V 4886_6.vcf.gz \
    -V 4886_7.vcf.gz \
    -V 4886_8.vcf.gz \
    -V 4902_1.vcf.gz \
    -V 4902_2.vcf.gz \
    -V 4902_3.vcf.gz \
    -V 4902_4.vcf.gz \
    -V 4902_5.vcf.gz \
    -V 4902_6.vcf.gz \
    -V 4902_7.vcf.gz \
    -V 4902_8.vcf.gz \
    -V 4943_1.vcf.gz \
    -V 4943_2.vcf.gz \
    -V 4943_3.vcf.gz \
    -V 4943_4.vcf.gz \
    -V 4943_5.vcf.gz \
    -V 4943_7.vcf.gz \
    -V 4943_8.vcf.gz \
    -V 4988_1.vcf.gz \
    -V 4988_2.vcf.gz \
    -V 4988_3.vcf.gz \
    -V 4988_4.vcf.gz \
    -V 5043_1.vcf.gz \
    -V 5043_2.vcf.gz \
    -V 5043_3.vcf.gz \
    -V 5043_4.vcf.gz \
    -V 5043_5.vcf.gz \
    -V 5043_6.vcf.gz \
    -V 5043_7.vcf.gz \
    -V 5043_8.vcf.gz \
    -V 5051_1.vcf.gz \
    -V 5051_2.vcf.gz \
    -V 5051_3.vcf.gz \
    -V 5051_4.vcf.gz \
    -V 5051_5.vcf.gz \
    -V 5051_6.vcf.gz \
    -V 5051_7.vcf.gz \
    -V 5051_8.vcf.gz \
    -V 5090_1.vcf.gz \
    -V 5090_2.vcf.gz \
    -V 5090_3.vcf.gz \
    -V 5090_4.vcf.gz \
    -V 5090_5.vcf.gz \
    -V 5090_6.vcf.gz \
    -V 5090_7.vcf.gz \
    -V 5090_8.vcf.gz \
    -V 5105_1.vcf.gz \
    -V 5105_2.vcf.gz \
    -V 5105_3.vcf.gz \
    -V 5105_4.vcf.gz \
    -V 5105_5.vcf.gz \
    -V 5105_6.vcf.gz \
    -V 5105_7.vcf.gz \
    -V 5105_8.vcf.gz

     

    And the other one is: --read-filter AllowAllReadsReadFilter

    gatk GenomicsDBImport \
    -R Homo_sapiens_assembly19.fasta \
    --read-filter AllowAllReadsReadFilter \
    --genomicsdb-workspace-path pon_db \
    -V 4666_1.vcf.gz \
    -V 4672_2.vcf.gz \
    -V 4732_8.vcf.gz \
    -V 4862_2.vcf.gz \
    -V 4862_3.vcf.gz \
    -V 4862_4.vcf.gz \
    -V 4862_5.vcf.gz \
    -V 4862_6.vcf.gz \
    -V 4862_7.vcf.gz \
    -V 4862_8.vcf.gz \
    -V 4886_5.vcf.gz \
    -V 4886_6.vcf.gz \
    -V 4886_7.vcf.gz \
    -V 4886_8.vcf.gz \
    -V 4902_1.vcf.gz \
    -V 4902_2.vcf.gz \
    -V 4902_3.vcf.gz \
    -V 4902_4.vcf.gz \
    -V 4902_5.vcf.gz \
    -V 4902_6.vcf.gz \
    -V 4902_7.vcf.gz \
    -V 4902_8.vcf.gz \
    -V 4943_1.vcf.gz \
    -V 4943_2.vcf.gz \
    -V 4943_3.vcf.gz \
    -V 4943_4.vcf.gz \
    -V 4943_5.vcf.gz \
    -V 4943_7.vcf.gz \
    -V 4943_8.vcf.gz \
    -V 4988_1.vcf.gz \
    -V 4988_2.vcf.gz \
    -V 4988_3.vcf.gz \
    -V 4988_4.vcf.gz \
    -V 5043_1.vcf.gz \
    -V 5043_2.vcf.gz \
    -V 5043_3.vcf.gz \
    -V 5043_4.vcf.gz \
    -V 5043_5.vcf.gz \
    -V 5043_6.vcf.gz \
    -V 5043_7.vcf.gz \
    -V 5043_8.vcf.gz \
    -V 5051_1.vcf.gz \
    -V 5051_2.vcf.gz \
    -V 5051_3.vcf.gz \
    -V 5051_4.vcf.gz \
    -V 5051_5.vcf.gz \
    -V 5051_6.vcf.gz \
    -V 5051_7.vcf.gz \
    -V 5051_8.vcf.gz \
    -V 5090_1.vcf.gz \
    -V 5090_2.vcf.gz \
    -V 5090_3.vcf.gz \
    -V 5090_4.vcf.gz \
    -V 5090_5.vcf.gz \
    -V 5090_6.vcf.gz \
    -V 5090_7.vcf.gz \
    -V 5090_8.vcf.gz \
    -V 5105_1.vcf.gz \
    -V 5105_2.vcf.gz \
    -V 5105_3.vcf.gz \
    -V 5105_4.vcf.gz \
    -V 5105_5.vcf.gz \
    -V 5105_6.vcf.gz \
    -V 5105_7.vcf.gz \
    -V 5105_8.vcf.gz

     

    But in both cases I'm having the same error as the beginning:

    A USER ERROR has occurred: Duplicate sample: PD4086bv2. Sample was found in both file:///home/adrianib/gatk-4.2.2.0/Genome_Analyzer_II/4862_2.vcf.gz and 4672_2.vcf.gz.

     

    I'm taking those arguments from here:

    https://gatk.broadinstitute.org/hc/en-us/articles/4405443657499-Mutect2

     

    It's like the program is ignoring those two arguments.

    There is a specific order to write those two arguments or can be in any place?

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi jesus ix ballote,

    I would not recommend that you solve the problem with those arguments. Instead, I would recommend that you fix the error in your files which is that multiple of your files have the same sample name. 

    It looks like the files from the error message don't have a proper sample name, instead, the sample name is just "Sample". You should change the sample names in these files so that they are all unique and match the sample names in the file.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    jesus ix ballote

    Hi, Genevieve-Brandt-she-her

    You were right, I had to change the name of the patient in all the files. This error consist in that diferent files are from the same patient and there fore have the same patient's name on it. 

    So when the patient's name is changed in all files so that it is not the same name in two or more files, it will no longer flag the error.

     I did the following steps:

    1. Decompress the file with the bgzip command:
    bgzip -d file.vcf.gz

    2. Open it

    3. Change this two names (in bold) inside the each file:

    ##tumor_sample=Patient_1
    #CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT   Patient_1

    4. Save the changes

    5. Compress the file

    bgzip file.vcf

    6. Finally, it is important re-index the file. I did with this command:

    tabix -p vcf file.vcf.gz

     

    My error was not re-index the file. But when I did, GATK works!

    Thanks for your advice!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thank you for posting your solution jesus ix ballote!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk