Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Empty vcf after GenotypeVCFs when combining already genotyped samples

0

8 comments

  • Avatar
    Tiffany Miller

    Hi @leze,

    Do both of these input vcfs pass ValidateVariants ? Can you explain what you mean by the first cohort (you mentioned you don't have access to the bams)? Are you running Haplotypecaller per sample?

    0
    Comment actions Permalink
  • Avatar
    leze

    Hi Tiffany Miller

    thanks for getting back to me.

    Yes, I did run ValidateVariants and the vcfs passed.

    Explanations to the datasets:

    1. Cohort: VCF with SNPs called with HC per sample and CombineGVCFs, GATK 3.5 (older samples, no bams available.

    2. Cohort: VCF/gVCF/BAM/fastq available (new samples). I used GATK 4.1.2.0.

    These two sets I tried combining, ideally using joint genotype calling from genotype likelihoods (annotations are available in both VCFs.

    Meanwhile, I was able to combine the two using the following approach:

    1. grep -vE "NON_REF" mixed.vcf > mixed.mod.vcf # removes all loci with NON_REF as ALT
      grep -vE "NON_REF" tetra.vcf > tetra.mod.vcf
    2. GATK 3.7 CombineVariants
    3. SelectVariants -select 'set == "Intersection"'

    This is ok for now but certainly not ideal. Maybe you have a suggestion to improve this

     

     

    0
    Comment actions Permalink
  • Avatar
    Tiffany Miller

    Hi leze I will check with the team and get back to you. 

    0
    Comment actions Permalink
  • Avatar
    Tiffany Miller

    Hi leze unfortunately, we don't have a better approach given your circumstance with not being able to regenerate the individual gvcfs for cohort 1. 

    Generally, we don't recommend joint calling across different versions of HC because there could be artifacts. The mapping quality annotations between these versions are incompatible. I wish I had a better suggestion.

    If anyone else in the community has experienced this and has something to add, please chime in!

    0
    Comment actions Permalink
  • Avatar
    Yangyxt

    I'm using GATK 4.1.8.1. And I found a similar error when trying to use GenotypeGVCFs to consolidate genotyping variants on chrX. It does not give any error message. It silently fails. BTW, I'm dealing with WES data.

    This is the code I used:

    For GenomicDBImport, I randomly select 50 samples from our history samples(using the same probe set) along with the current batch.

    time ${gatk} --java-options "-Xmx8g -Xms2g" GenomicsDBImport
    --tmp-dir /paedyl01/disk1/yangyxt/test_tmp
    --genomicsdb-update-workspace-path ${vcf_dir}/genomicdbimport_chr${1}
    -R ${ref_gen}/ucsc.hg19.fasta
    --batch-size 0
    --sample-name-map ${gvcf}/batch_cohort.sample_map
    --reader-threads 5
    check_return_code

    For GenotypeGVCFs

    time ${gatk} --java-options "-Xmx8g -Xms2g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" GenotypeGVCFs
    -R ${ref_gen}/ucsc.hg19.fasta
    -V gendb://${vcf_dir}/genomicdbimport_chr${1}
    -G StandardAnnotation
    -G AS_StandardAnnotation
    -L chr${1}
    -O ${bgvcf}/all_${seq_type}samples_plus${sample_batch}.chr${1}.HC.vcf

    These are log records:

    02:07:51.286 WARN GATKAnnotationPluginDescriptor - Redundant enabled annotation group (StandardAnnotation) is enabled for this tool by default
    02:07:51.321 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/yangyxt/software/gatk-4.1.8.1/gatk-package-4.1.8.1-local.jar!/com/intel/gkl/native/libgkl
    Nov 06, 2020 2:07:56 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    02:07:56.529 INFO GenotypeGVCFs - ------------------------------------------------------------
    02:07:56.529 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.1.8.1
    02:07:56.530 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
    02:08:01.543 INFO GenotypeGVCFs - Executing as yangyxt@paedyl01 on Linux v3.10.0-1062.18.1.el7.x86_64 amd64
    02:08:01.543 INFO GenotypeGVCFs - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_161-b12
    02:08:01.543 INFO GenotypeGVCFs - Start Date/Time: November 6, 2020 2:07:51 AM HKT
    02:08:01.543 INFO GenotypeGVCFs - ------------------------------------------------------------
    02:08:01.544 INFO GenotypeGVCFs - ------------------------------------------------------------
    02:08:01.544 INFO GenotypeGVCFs - HTSJDK Version: 2.23.0
    02:08:01.545 INFO GenotypeGVCFs - Picard Version: 2.22.8
    02:08:01.545 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    02:08:01.545 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    02:08:01.545 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    02:08:01.545 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    02:08:01.545 INFO GenotypeGVCFs - Deflater: IntelDeflater
    02:08:01.545 INFO GenotypeGVCFs - Inflater: IntelInflater
    02:08:01.545 INFO GenotypeGVCFs - GCS max retries/reopens: 20
    02:08:01.545 INFO GenotypeGVCFs - Requester pays: disabled
    02:08:01.546 INFO GenotypeGVCFs - Initializing engine
    02:08:02.257 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.3.0-e701905
    02:08:02.331 info NativeGenomicsDB - pid=161923 tid=161924 No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF
    02:08:02.331 info NativeGenomicsDB - pid=161923 tid=161924 No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the
    02:08:02.331 info NativeGenomicsDB - pid=161923 tid=161924 No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated
    02:08:02.331 info NativeGenomicsDB - pid=161923 tid=161924 No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated
    02:08:02.380 INFO IntervalArgumentCollection - Processing 155270560 bp from intervals
    02:08:02.385 INFO GenotypeGVCFs - Done initializing engine
    02:08:02.426 INFO ProgressMeter - Starting traversal
    02:08:02.426 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
    GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.0,Cpu time(s),0.0
    02:08:02.685 INFO ProgressMeter - unmapped 0.0 0 0.0
    02:08:02.685 INFO ProgressMeter - Traversal complete. Processed 0 total variants in 0.0 minutes.
    02:08:02.688 INFO GenotypeGVCFs - Shutting down engine
    [November 6, 2020 2:08:02 AM HKT] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 0.19 minutes.
    Runtime.totalMemory()=2190999552
    Using GATK jar /home/yangyxt/software/gatk-4.1.8.1/gatk-package-4.1.8.1-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx8g -Xms2g -D

    real 0m18.775s
    user 0m8.039s
    sys 0m1.440s

    ##According to the log, it seems the traversal never started. But I checked the gvcf file on chrX of the current batch and I'm pretty sure there are lot's of variant records (for a single sample, around 25000 variant records on chrX) looking like these:
    chrX 2699968 . A G,<NON_REF> 894.06 . AS_RAW_BaseQRankSum=||;AS_RAW_MQ=0.00|75600.00|0.00;AS_RAW_MQRankSum=||;AS_RAW_ReadPosRankSum=||;AS_SB_TABLE=0,0|21,
    chrX 2700027 . T C,<NON_REF> 1681.06 . AS_RAW_BaseQRankSum=||;AS_RAW_MQ=0.00|165600.00|0.00;AS_RAW_MQRankSum=||;AS_RAW_ReadPosRankSum=||;AS_SB_TABLE=0,0|34
    chrX 2704469 . T C,<NON_REF> 70.64 . AS_RAW_BaseQRankSum=|-0.8,1|NaN;AS_RAW_MQ=10800.00|14400.00|0.00;AS_RAW_MQRankSum=|0.0,1|NaN;AS_RAW_ReadPosRankSum=|
    chrX 2704556 . A G,<NON_REF> 105.64 . AS_RAW_BaseQRankSum=|-0.6,1|NaN;AS_RAW_MQ=7200.00|14400.00|0.00;AS_RAW_MQRankSum=|0.0,1|NaN;AS_RAW_ReadPosRankSum=|1
    chrX 2706896 . T TA,<NON_REF> 44.27 . AS_RAW_BaseQRankSum=||;AS_RAW_MQ=0.00|7200.00|0.00;AS_RAW_MQRankSum=||;AS_RAW_ReadPosRankSum=||;AS_SB_TABLE=0,0|0,0|
    chrX 2707060 . A G,<NON_REF> 37.31 . AS_RAW_BaseQRankSum=||;AS_RAW_MQ=0.00|7200.00|0.00;AS_RAW_MQRankSum=||;AS_RAW_ReadPosRankSum=||;AS_SB_TABLE=0,0|0,0|
    chrX 2710840 . C T,<NON_REF> 73.83 . AS_RAW_BaseQRankSum=||;AS_RAW_MQ=0.00|10800.00|0.00;AS_RAW_MQRankSum=||;AS_RAW_ReadPosRankSum=||;AS_SB_TABLE=0,0|2,1
    chrX 2710995 . C T,<NON_REF> 192.97 . AS_RAW_BaseQRankSum=||;AS_RAW_MQ=0.00|21600.00|0.00;AS_RAW_MQRankSum=||;AS_RAW_ReadPosRankSum=||;AS_SB_TABLE=0,0|4,2
    chrX 2711528 . T A,<NON_REF> 29.64 . AS_RAW_BaseQRankSum=|0.0,1|NaN;AS_RAW_MQ=7200.00|7200.00|0.00;AS_RAW_MQRankSum=|0.0,1|NaN;AS_RAW_ReadPosRankSum=|1.3
    chrX 2711722 . C T,<NON_REF> 37.31 . AS_RAW_BaseQRankSum=||;AS_RAW_MQ=0.00|7200.00|0.00;AS_RAW_MQRankSum=||;AS_RAW_ReadPosRankSum=||;AS_SB_TABLE=0,0|0,0|
    chrX 2712283 . C T,<NON_REF> 66.31 . AS_RAW_BaseQRankSum=||;AS_RAW_MQ=0.00|7200.00|0.00;AS_RAW_MQRankSum=||;AS_RAW_ReadPosRankSum=||;AS_SB_TABLE=0,0|0,0|

    Every other chromosome works well except for chrX. The output file ${bgvcf}/all_${seq_type}samples_plus${sample_batch}.chr${1}.HC.vcf is just a vcf header file with no real records.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Yangyxt please share the complete stack trace from GenomicsDBImport.

    0
    Comment actions Permalink
  • Avatar
    Yangyxt

    Dear Genevieve Brandt (she/her)

    Sorry I cannot replicate the error anymore. I re-ran the same command and it seems fine now. 

    The old log file is forcibly covered. 

    I add DGATK_STACKTRACE_ON_USER_EXCEPTION=true to both GenomicDBImport and GenotypeVCF now. If I ran into the same issue again. I'll report the complete stack trace here.

    Thank you!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Yangyxt great, thanks for the update! Hopefully it keeps working. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk