Why Haplotypecaller gives me empty vcf (only the header) for all of my samples?
REQUIRED for all errors and issues:
a) GATK version used:
gatk-package-4.5.0.0
b) Exact command used:
gatk HaplotypeCaller -R ${gencode_path}/Homo_sapiens_assembly38.fasta \
-I ${aligned_reads}/sample1_sort_dup_bqsr.bam \
-ERC GVCF \
-O ${data_Output}/sample1.g.vcf.gz
c) Entire program log:
-
Can you provide us more details about your workflow? What is your aligner for these bam files?
Do you get any error messages within the HaplotypeCaller run? If so can you post that here?
Regards.
-
Thanks for your reply!
I am using bwa/0.7.16 for aligning:
bwa mem -t 4 -R "@RG\tID:${1}\tPL:ILLUMINA\tSM:sample1" \
${gencode_path2}/hg38.fa \
${raw_data}/sample1_R1.fastq.gz \
${raw_data}/sample1_R2.fastq.gz > \
${aligned_reads}/sample1.paired.sam
Then use gatk4/4.4.0.0 to mark duplicates and sort:
gatk MarkDuplicatesSpark -I ${aligned_reads}/sample1.paired.sam \
-O ${aligned_reads}/sample1_sorted_dedup_reads.bam
Next, do base quality recalibration:
gatk BaseRecalibrator \
-I ${aligned_reads}/sample1_sorted_dedup_reads.bam -R \
${gencode_path}/hg38.fa \
--known-sites ${gencode_path}/Homo_sapiens_assembly38.dbsnp138.vcf \
-O ${data_output}/recal_data.table
Then,
gatk ApplyBQSR \
-I ${aligned_reads}/sample1_sorted_dedup_reads.bam \
-R ${gencode_path}/hg38.fa \
--bqsr-recal-file {$data_output}/recal_data.table \
-O ${aligned_reads}/sample1_sorted_dedup_bqsr_reads.bam
Next collect the alignments and here is when it gives zeros:
gatk CollectAlignmentSummaryMetrics \
R=${gencode_path}/hg38.fa \
I=${aligned_reads}/sample1_sorted_dedup_bqsr_reads.bam \
O=${aligned_reads}/sample1_alignment_metrics.txt
-
Can you check whether your logs contain any kind of error messages and early terminations? If so can you post them here?
-
Sorry, I edited my previous post. No error happens and the zeros appear when I collect the alignments:
## METRICS CLASS picard.analysis.AlignmentSummaryMetrics
CATEGORY TOTAL_READS PF_READS PCT_PF_READS PF_NOISE_READS PF_READS_ALIGNED PCT_PF_READS_ALIGNED PF_ALIGNED_BASES PF_HQ_ALIGNED_READS PF_HQ_ALIGNED_BASES PF_HQ_ALIGNED_Q20_BASES PF_HQ_MEDIAN_MISMATCHES PF_MISMATCH_RATE PF_HQ_ERROR_RATE PF_INDEL_RATE MEAN_READ_LENGTH
SD_READ_LENGTH MEDIAN_READ_LENGTH MAD_READ_LENGTH MIN_READ_LENGTH MAX_READ_LENGTH MEAN_ALIGNED_READ_LENGTH READS_ALIGNED_IN_PAIRS PCT_READS_ALIGNED_IN_PAIRS PF_READS_IMPROPER_PAIRS PCT_PF_READS_IMPROPER_PAIRS BAD_CYCLES STRAND_BALANCE PCT_CHIMERAS PCT_ADAPTER PCT_SOFTCLIP PCT_HARDCLIP AVG_POS_3PRIME_SOFTCLIP_LENGTH SAMPLE LIBRARY READ_GROUP
UNPAIRED 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
-
What kind of experiment is this one? Can you check if your final sorted bam file contains any reads inside?
-
It is WES with Illumina platform. Yes, sample1_sorted_dedup_bqsr_reads.bam contains reads. Is there any details I need to ask people who provided the fastq files?
-
Here is the top few lines of sample1_sorted_dedup_bqsr_reads.bam:
LH00557:25:22KNK7LT3:5:1143:42931:4413 65 chr1 10000 0 3S97M chr2 32672328 0 ATAATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC III-IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII99II-III99IIIIIIII9IIIIIIIIIII-II9I MC:Z:4S96M MD:Z:97 RG:Z:15650 NM:i:0 AS:i:97 XS:i:96
LH00557:25:22KNK7LT3:5:1254:10616:1897 97 chr1 10000 0 4S96M = 211929503 211919599 CCTCATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC IIII-IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII-IIIII-IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII MC:Z:96M4S MD:Z:96 RG:Z:15650 NM:i:0 AS:i:96 XS:i:95
LH00557:25:22KNK7LT3:5:1163:46158:1929 177 chr1 10000 0 42S53M5S chr10 128147464 0 TGGAAAACCCTAGGGAATATAAAAATGTAGCGTTTATTCCATATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAGAAG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII-III SA:Z:chr1,85987646,-,43M57S,0,0; MC:Z:95M5S MD:Z:53 RG:Z:15650 NM:i:0 AS:i:53 XS:i:53
LH00557:25:22KNK7LT3:5:1129:30245:17551 1105 chr1 10013 0 97M3S chr11 79987814 0 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCACAG IIIIIIIIIIIIIIIIIIIIIII-IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII MC:Z:100M MD:Z:97 RG:Z:15650 NM:i:0 AS:i:97 XS:i:96
LH00557:25:22KNK7LT3:5:1129:30264:17551 1105 chr1 10013 0 97M3S chr11 79987818 0 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCACAG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII-II- MC:Z:4S96M MD:Z:97 RG:Z:15650 NM:i:0 AS:i:97 XS:i:96
-
What bothers me here is the all zeroes indicating either there are no reads present or all reads have MAPQ 0 which results in filtering all of those reads by the tool.
Can you try running the below command and see how many reads are present?
samtools view -q 60 sample1_sorted_dedup_bqsr_reads.bam | wc -l
-
Thank you so much for your guide!
It gives me 0!!!
Does it mean that I am miss-aligning? I contacted people providing the data and they said either GRCh37 or GRCh38 references can be used for aligning. Is it different from what I have used (hg38)? I have also tried it with Homo_sapiens_assembly38.fasta but still the same.
-
I just checked
samtools view -q 60 sample1_sorted_dedup_reads.bam | wc -l
And it gives me: 46453705. Does it mean that ApplyBQSR is not performing correctly or does it mean that something is wrong with my data? If the later is true, may I ask what it is so I can tell the people who provided the data?
Thank you so much for all your helps!
-
Can you post the log for BaseRecalibrator and ApplyBQSR steps here so that we can check?
One thing that I noticed is that your base call qualities are too low. 6 and 16 are the values that I observed from the first couple of reads you posted here. Can you also check your bam file with fastqc or any other read checking tool to see if basecall qualities are any good?
I am also concerned that if this data was further compressed in lossy format then it is possible that basecall values are lost in translation but this is just a guess.
-
Hi,
Do you mean this log (I had provided it in previous comments)?
gatk BaseRecalibrator \
-I ${aligned_reads}/sample1_sorted_dedup_reads.bam -R \
${gencode_path}/hg38.fa \
--known-sites ${gencode_path}/Homo_sapiens_assembly38.dbsnp138.vcf \
-O ${data_output}/recal_data.table
Then,
gatk ApplyBQSR \
-I ${aligned_reads}/sample1_sorted_dedup_reads.bam \
-R ${gencode_path}/hg38.fa \
--bqsr-recal-file {$data_output}/recal_data.table \
-O ${aligned_reads}/sample1_sorted_dedup_bqsr_reads.bam
The fastqc on fastq files shows good quality. Do you mean checking sample1_sorted_dedup_bqsr_reads.bam file on fastqc?
-
To me it looks like issue is happening at ApplyBQSR step. Am I right?
Here is the fastqc of raw fastq files (R1 & R2),
Filename sample1_R2.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 27327351 Sequences flagged as poor quality 0 Sequence length 100 %GC 46 Filename sample1_R1.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 27327351 Sequences flagged as poor quality 0 Sequence length 100 %GC 46
sample1_sorted_dedup_reads.bam,
Filename sample1_sorted_dedup_reads.bam File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 54975032 Sequences flagged as poor quality 0 Sequence length 30-100 %GC 46 and sample1_sorted_dedup_bqsr_reads.bam.
Filename sample1_sorted_dedup_bqsr_reads.bam File type null Encoding Illumina 1.5 Total Sequences 0 Sequences flagged as poor quality 0 Sequence length 0 %GC 0
-
Log is created by our tool during the runtime and pushed to stderr normally unless it is redirected to a log file by the command line. Can you send us that output from ApplyBQSR and BaseRecalibrator?
-
Thanks for explaining. Here is the top lines of BaseRecalibrator:
23:23:43.288 INFO BaseRecalibrator - ------------------------------------------------------------
23:23:43.290 INFO BaseRecalibrator - The Genome Analysis Toolkit (GATK) v4.4.0.0
23:23:43.291 INFO BaseRecalibrator - For support and documentation go to https://software.broadinstitute.org/gatk/
23:23:43.291 INFO BaseRecalibrator - Executing as fdarbaniyan@edragon017 on Linux v4.18.0-425.3.1.el8.x86_64 amd64
23:23:43.291 INFO BaseRecalibrator - Java runtime: OpenJDK 64-Bit Server VM v17.0.3-internal+0-adhoc..src
23:23:43.291 INFO BaseRecalibrator - Start Date/Time: June 19, 2024 at 11:23:43 PM CDT
23:23:43.291 INFO BaseRecalibrator - ------------------------------------------------------------
23:23:43.291 INFO BaseRecalibrator - ------------------------------------------------------------
23:23:43.292 INFO BaseRecalibrator - HTSJDK Version: 3.0.5
23:23:43.292 INFO BaseRecalibrator - Picard Version: 3.0.0
23:23:43.292 INFO BaseRecalibrator - Built for Spark Version: 3.3.1
23:23:43.292 INFO BaseRecalibrator - HTSJDK Defaults.COMPRESSION_LEVEL : 2
23:23:43.293 INFO BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
23:23:43.294 INFO BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
23:23:43.294 INFO BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
23:23:43.294 INFO BaseRecalibrator - Deflater: IntelDeflater
23:23:43.294 INFO BaseRecalibrator - Inflater: IntelInflater
23:23:43.294 INFO BaseRecalibrator - GCS max retries/reopens: 20
23:23:43.294 INFO BaseRecalibrator - Requester pays: disabled
23:23:43.294 INFO BaseRecalibrator - Initializing engine
23:23:43.515 INFO FeatureManager - Using codec VCFCodec to read file file:///rsrch6/scratch/hema_bio-Malignan/fdarbaniyan/gencode/hg38/Homo_sapiens_assembly38.dbsnp138.vcf
23:23:43.665 INFO BaseRecalibrator - Done initializing engine
23:23:43.669 INFO BaseRecalibrationEngine - The covariates being used here:
23:23:43.669 INFO BaseRecalibrationEngine - ReadGroupCovariate
23:23:43.669 INFO BaseRecalibrationEngine - QualityScoreCovariate
23:23:43.669 INFO BaseRecalibrationEngine - ContextCovariate
23:23:43.669 INFO BaseRecalibrationEngine - CycleCovariate
23:23:43.676 INFO ProgressMeter - Starting traversal
23:23:43.676 INFO ProgressMeter - Current Locus Elapsed Minutes Reads Processed Reads/Minute
23:23:53.693 INFO ProgressMeter - chr1:24183589 0.2 471000 2822049.1
23:24:03.697 INFO ProgressMeter - chr1:52237854 0.3 998000 2990859.6
23:24:13.717 INFO ProgressMeter - chr1:93984700 0.5 1429000 2854099.4
23:24:23.721 INFO ProgressMeter - chr1:152215153 0.7 1938000 2903733.3
23:24:33.735 INFO ProgressMeter - chr1:180175429 0.8 2467000 2956969.9
23:24:43.769 INFO ProgressMeter - chr1:217158349 1.0 2922000 2917526.5
23:24:53.795 INFO ProgressMeter - chr10:4318621 1.2 3364000 2878535.1
23:25:03.813 INFO ProgressMeter - chr10:44772698 1.3 3760000 2815179.0
23:25:13.824 INFO ProgressMeter - chr10:86460471 1.5 4184000 2784784.9
23:25:23.823 INFO ProgressMeter - chr10:119915812 1.7 4668000 2796688.9
23:25:33.825 INFO ProgressMeter - chr11:17527238 1.8 5148000 2804227.0
23:25:43.834 INFO ProgressMeter - chr11:59647750 2.0 5581000 2786830.7
23:25:53.851 INFO ProgressMeter - chr11:85976562 2.2 6115000 2818513.5
-
And here is for ApplyBQSR:
23:36:27.258 INFO ApplyBQSR - ------------------------------------------------------------
23:36:27.260 INFO ApplyBQSR - The Genome Analysis Toolkit (GATK) v4.4.0.0
23:36:27.260 INFO ApplyBQSR - For support and documentation go to https://software.broadinstitute.org/gatk/
23:36:27.260 INFO ApplyBQSR - Executing as fdarbaniyan@edragon017 on Linux v4.18.0-425.3.1.el8.x86_64 amd64
23:36:27.260 INFO ApplyBQSR - Java runtime: OpenJDK 64-Bit Server VM v17.0.3-internal+0-adhoc..src
23:36:27.260 INFO ApplyBQSR - Start Date/Time: June 19, 2024 at 11:36:27 PM CDT
23:36:27.260 INFO ApplyBQSR - ------------------------------------------------------------
23:36:27.261 INFO ApplyBQSR - ------------------------------------------------------------
23:36:27.261 INFO ApplyBQSR - HTSJDK Version: 3.0.5
23:36:27.261 INFO ApplyBQSR - Picard Version: 3.0.0
23:36:27.261 INFO ApplyBQSR - Built for Spark Version: 3.3.1
23:36:27.261 INFO ApplyBQSR - HTSJDK Defaults.COMPRESSION_LEVEL : 2
23:36:27.262 INFO ApplyBQSR - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
23:36:27.262 INFO ApplyBQSR - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
23:36:27.262 INFO ApplyBQSR - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
23:36:27.262 INFO ApplyBQSR - Deflater: IntelDeflater
23:36:27.262 INFO ApplyBQSR - Inflater: IntelInflater
23:36:27.262 INFO ApplyBQSR - GCS max retries/reopens: 20
23:36:27.262 INFO ApplyBQSR - Requester pays: disabled
23:36:27.262 INFO ApplyBQSR - Initializing engine
23:36:27.411 INFO ApplyBQSR - Done initializing engine
23:36:27.428 INFO ProgressMeter - Starting traversal
23:36:27.429 INFO ProgressMeter - Current Locus Elapsed Minutes Reads Processed Reads/Minute
23:36:27.502 INFO ApplyBQSR - Shutting down engine
[June 19, 2024 at 11:36:27 PM CDT] org.broadinstitute.hellbender.tools.walkers.bqsr.ApplyBQSR done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=285212672
***********************************************************************
A USER ERROR has occurred: Couldn't read file. Error was: {/rsrch6/scratch/hema_bio-Malignan/fdarbaniyan/Sattva_data/WES/P2007014_05162024/Bioinfirmagician_output/results}/15650_recal_data.table with exception: {/rsrch6/scratch/hema_bio-Malignan/fdarbaniyan/Sattva_data/WES/P2007014_05162024/Bioinfirmagician_output/results}/15650_recal_data.table (No such file or directory)
***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
Using GATK jar /risapps/rhel8/miniconda3/py39_4.12.0/envs/gatk4-4.4.0.0/share/gatk4-4.4.0.0-0/gatk-package-4.4.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /risapps/rhel8/miniconda3/py39_4.12.0/envs/gatk4-4.4.0.0/share/gatk4-4.4.0.0-0/gatk-package-4.4.0.0-local.jar BaseRecalibrator -I /rsrch6/scratch/hema_bio-Malignan/fdarbaniyan/Sattva_data/WES/P2007014_05162024/Bioinfirmagician_output/aligned_reads/61850_LYG-3_sorted_dedup_reads.bam -R /rsrch6/scratch/hema_bio-Malignan/fdarbaniyan/gencode/hg38/hg38.fa --known-sites /rsrch6/scratch/hema_bio-Malignan/fdarbaniyan/gencode/hg38/Homo_sapiens_assembly38.dbsnp138.vcf -O /rsrch6/scratch/hema_bio-Malignan/fdarbaniyan/Sattva_data/WES/P2007014_05162024/Bioinfirmagician_output/results/61850_LYG-3_recal_data.table
-
The error message indicates that recalibration table file is not present or path is incorrect for ApplyBQSR. Can you check if that file is properly created by BaseRecalibrator tool?
-
Thank you so much!
You are right and I had a typo in my code. Now it works perfectly and I got my results.
Appreciate!
Faezeh
Please sign in to leave a comment.
18 comments