Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Read Group: ID for split files of multiple samples on multiple lanes

0

7 comments

  • 0
    Comment actions Permalink
  • Avatar
    Himawari

    Hi @Bhanu Gandham,

    Thank you for your reply. I actually went through the links you provided quite a lot trying to make sure that I got things right before creating an account to post here. While I got it through for normal fastq, it did not help me out for the split fastq files (i.e., the 001s - 004s). I have also shown you my attempt in it. But, I am not sure if it is correct or not.

    It IS CORRECT for my non-split fastq files (because it works).

    I also tried merging the files in the fastq format instead of BAM file format (although, I did both ways). And here, leads to another new problem which I don't quite understand. I could not proceed in MarkDuplicates.

    It always leads me to this:

    Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR::READ_GROUP_NOT_FOUND:Record 1, Read name HWI-ST688:152:81PMPABXX:1:1206:2462:160277, RG ID on SAMRecord not found in header: SCDO_1_X2_152.LOO1

    And the program just terminates at there.

    I have checked my bam files to make sure I have the @RG ID and all, and it is just there. And, just to make sure that I actually did something wrong, I repeated the same thing for the set of files that I have that were not split, and there was no problem with it.

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Can you please post the bam header using "samtool view -H" command.

    0
    Comment actions Permalink
  • Avatar
    Himawari

    Hi Bhanu Gandham

    Here it is (these 2 are unsuccessful..the difference between these 2 are that one the 1st set's was merged at FASTQ file stage; the 2nd set was merged at BAM file stage):

    @SQ SN:chrUn_GL000214v1 LN:137718 M5:46c2032c37f2ed899eb41c0473319a69 UR:file:/home/himawari/TEST/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
    @SQ SN:chrUn_KI270742v1 LN:186739 M5:2f31c013a4a8301deb8ab7ed1ca1cd99 UR:file:/home/himawari/TEST/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
    @SQ SN:chrUn_GL000216v2 LN:176608 M5:725009a7e3f5b78752b68afa922c090c UR:file:/home/himawari/TEST/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
    @SQ SN:chrUn_GL000218v1 LN:161147 M5:1d708b54644c26c7e01c2dad5426d38c UR:file:/home/himawari/TEST/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
    @SQ SN:chrEBV LN:171823 M5:6743bd63b3ff2b5b8985d8933c53290a UR:file:/home/himawari/TEST/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
    @RG ID:SCDO_1_X2_152.L001 LB:SCDO_1_X2_152 PL:illumina SM:SCDO_1_X2_152 PU:CGATGT
    @PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -t 12 -R @RG\tID:SCDO_1_X2_152.LOO1\tSM:SCDO_1_X2_152\tLB:SCDO_1_X2_152\tPL:illumina\tPU:CGATGT /home/himawari/REF/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna /home/himawari/SCD/INPUT/SCDO_1_X2/SCDO_1_X2_CGATGT_0152_L001_R1.fastq.gz /home/himawari/SCD/INPUT/SCDO_1_X2/SCDO_1_X2_CGATGT_0152_L001_R2.fastq.gz
    @SQ SN:chrUn_GL000214v1 LN:137718 M5:46c2032c37f2ed899eb41c0473319a69 UR:file:/home/himawari/TEST/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
    @SQ SN:chrUn_KI270742v1 LN:186739 M5:2f31c013a4a8301deb8ab7ed1ca1cd99 UR:file:/home/himawari/TEST/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
    @SQ SN:chrUn_GL000216v2 LN:176608 M5:725009a7e3f5b78752b68afa922c090c UR:file:/home/himawari/TEST/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
    @SQ SN:chrUn_GL000218v1 LN:161147 M5:1d708b54644c26c7e01c2dad5426d38c UR:file:/home/himawari/TEST/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
    @SQ SN:chrEBV LN:171823 M5:6743bd63b3ff2b5b8985d8933c53290a UR:file:/home/himawari/TEST/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
    @RG ID:SCDO_1_X2_152.L001 LB:SCDO_1_X2_152 PL:illumina SM:SCDO_1_X2_152 PU:CGATGT
    @PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -t 12 -R @RG\tID:SCDO_1_X2_152.LOO1\tSM:SCDO_1_X2_152\tLB:SCDO_1_X2_152\tPL:illumina\tPU:CGATGT /home/himawari/REF/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna /home/himawari/SCD/INPUT/SCDO_1_X2/SCDO_1_X2_CGATGT_0152_L001_R1_001.fastq.gz /home/himawari/SCD/INPUT/SCDO_1_X2/SCDO_1_X2_CGATGT_0152_L001_R2_001.fastq.gz

    and here is the successful one (this file never went through any sort of splitting procedure):

    @SQ SN:chrUn_GL000214v1 LN:137718 M5:46c2032c37f2ed899eb41c0473319a69 UR:file:/home/himawari/TEST/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
    @SQ SN:chrUn_KI270742v1 LN:186739 M5:2f31c013a4a8301deb8ab7ed1ca1cd99 UR:file:/home/himawari/TEST/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
    @SQ SN:chrUn_GL000216v2 LN:176608 M5:725009a7e3f5b78752b68afa922c090c UR:file:/home/himawari/TEST/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
    @SQ SN:chrUn_GL000218v1 LN:161147 M5:1d708b54644c26c7e01c2dad5426d38c UR:file:/home/himawari/TEST/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
    @SQ SN:chrEBV LN:171823 M5:6743bd63b3ff2b5b8985d8933c53290a UR:file:/home/himawari/REF/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
    @RG ID:FLS928.LOO1 LB:FLS928 PL:illumina SM:FLS928 PU:unit1
    @PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -t 12 -R @RG\tID:FLS928.LOO1\tSM:FLS928\tLB:FLS928\tPL:illumina\tPU:unit1 /home/himawari/REF/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna /home/himawari/NGS_TEST/INPUT/FLS928_R1.fastq.gz /home/himawari/NGS_TEST/INPUT/FLS928_R2.fastq.gz

    Thank you.

    EDIT:
    Here are my step processes I used to get that file:
    1. Cat the fastq.gz files.
    2. Generate uBAM.
    3. Align using bwa mem.
    4. Sort (and convert) SAM/BAM files according to coordinate.
    5. Merge the sorted BAM files (for the set that was not cat in step 1) using samtools merge.
    6. Add read groups (both uBAM and BAM files).
    7. Merge uBAM and BAM files using Picards' MergeBamAlignment.
    8. Mark duplicates the merged BAM files from step 7. << this failed.

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi Himawari

    We recommend you use these pre-built format conversion workflows: https://app.terra.bio/#workspaces/help-gatk/Sequence-Format-Conversion/workflows/help-gatk/Paired-FASTQ-to-Unmapped-BAM

     

    Since we have not tested out the workflow you have created it is difficult for us to debug it. However, try the workflows we have provided and if you still see an error with that we will help fix it.

    0
    Comment actions Permalink
  • Avatar
    Himawari

    Hi Bhanu Gandham

    I am not sure what happened. I re-run the the script again after giving up for a day and it became successful. I did not change anything from the script except for the reference genome version hg38 to b37.

    Is there any reasons as to why this happened?

    Also, pardon me for all the questions, I am now stuck in the FilterVariantTranches part. I am doing an exome analysis and I am not quite sure the parameters I should use. Should I open another thread for it? (I have searched though the community, I don't see anyone posting or a solution yet; it was experimental previously).

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Yes please open a different thread. Thank you for checking with us!

    0
    Comment actions Permalink

Post is closed for comments.

Powered by Zendesk