Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

How to create a vcf file with variants of pooled lines?

0

5 comments

  • Avatar
    Gökalp Çelik

    Hi E Ra

    You seem to be looking for a very specific way of using your VCF files which by default is neither supported nor provided by our tools. If you wish to have joint genotyping with your samples we recommend using GenotypeGVCFs tool which genotypes combined GVCF files provided by HaplotypeCaller and CombineGVCFs and/or GenomicsDBImport. 

    I hope this helps. 

    0
    Comment actions Permalink
  • Avatar
    E Ra

    Hi Gökalp

    After some research, I think that the fastq files of the lines of the same group need to be concatenated before the variant calling step.

    Thank you!

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Ah I see. You are looking for pooling samples within the same bam file but call them in a VCF file that has separate samplename entries per pool. 

    It is quite easy. It is possible to assign different readgroups to different pools within the bam file during the mapping stage. Once you are set you will have a bam file with multiple samples(pools) within therefore GATK tools will treat them as separate samples and will generate VCFs that include different pools for each variant site. 

    I hope this helps. 

    0
    Comment actions Permalink
  • Avatar
    E Ra

    Hi Gökalp,

    I would like to try that. How would the code look like?

    Thank you

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi

    We cannot directly provide a running code for your request however here is how it would be flowwise

    1- Map read per pool to reference genome. Assign readgroups with unique IDs and SampleNames. You may use RevertSam and MergeBamAlignment tools to do this or you may directly assign during the mapping stage. Most mappers allow this. 

    2- Merge all aligned pools into a single bam file using samtools merge or gatk PrintReads tools for this purpose.

    3- Run HaplotypeCaller to call variants with the proper ploidy parameter. 

    Your readgroups should be set similar to the one below in the final bam file. 

    @RG    ID:RG1    SM:Pool1    LB:LibraryName1    PL:PLATFORMID  PU:CENTERID
    @RG    ID:RG2    SM:Pool2    LB:LibraryName2    PL:PLATFORMID    PU:CENTERID

    Once this bam file is fed to HaplotypeCaller, haplotypecaller will produce a VCF file with multiple samples indicated with SM fields in the header. 

    I hope this helps. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk