GenotypeGVCFs generates an empty VCF file with all my samples
REQUIRED for all errors and issues:
a) GATK version used: v4.2.2.0
b) Exact command used:
output_dir="/home/wovalle3/LREY/variant_call/prueba"
gatk --java-options "-Xmx4g -Xms4g" GenomicsDBImport \
$(for file in $output_dir/gvcfs/*.g.vcf.gz; do echo "-V $file"; done) \
--genomicsdb-workspace-path my_database \
--tmp-dir = /home/wovalle3/LREY/variant_call/prueba/gvcfs/ \
--intervals chr1 \
--intervals chr2 \
--intervals chr3 \
--intervals chr4 \
--intervals chr5 \
--intervals chr6 \
--intervals chr7 \
--intervals chr8 \
--intervals chr9 \
--intervals chr10 \
--intervals chr11 \
--intervals chr12 \
--intervals chr13 \
--intervals chr14 \
--intervals chr15 \
--intervals chr16 \
--intervals chr17 \
--intervals chr18 \
--intervals chr19 \
--intervals chr20 \
--intervals chr21 \
--intervals chr22 \
--intervals chrX
c) Entire program log:
It seems that when running the GenomicsDBImport command in GATK, I'm encountering a warning "WARN IntelInflater - Zero Bytes Written : 0", despite no error being generated. This warning suggests that no genomic data is being imported. I already verified that my vcf.gz
files obtained from HaplotypeCaller do contain information.
However, when I proceed to the GenotypeGVCFs step, it generates an empty VCF file with all my samples. This leads me to suspect that the issue may be occurring during the GenomicsDBImport command.
Does anyone know why I might be having this problem?
-
Can you try adding
--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true'
so that you may be able to see if there are any underlying hidden error messages?
One recommendation. Instead of injecting bash script calls into the command line generation you can use the below parameter to import your GVCF files with least pain.
--sample-name-map samples.list
samples.list file should look like below
sample1 /path/to/sample1.g.vcf.gz
sample2 /path/to/sample2.g.vcf.gzfields must be separated by tabs.
Also can you try running your import with the latest version of 4.5.0.0 if possible?
I hope this helps.
-
Thank you for the response, however, your suggestions did not work.
I tried running GenotypeGVCFs with only one of the vcf files I got from the HaplotypeCaller, instead of the input gendb://my_database from GenomicsDBImport, just to check if the problem was because of mixed-up paths, but I'm still getting an empty vcf.
this was the command line I used:
gatk --java-options "-Xmx4g" GenotypeGVCFs \
-R /home/wovalle3/Samples/GRCh38_Gencode31/hg38.fa \
-V NT-260_Aligned.out.g.vcf.gz \
-O output_prueba.vcf.gz -
It is interesting that you are still getting an empty file when genotyping a single one. Have you received any error messages during the preparation of those GVCF files?
Can you check the contents of the GVCF file and share a few lines using the command below and share with us?
bcftools view -H NT-260_Aligned.out.g.vcf.gz | head -n 10
Regards.
-
As far as I checked, HaplotypeCaller did not present any errors. I ran the command line you suggested in only one of the samples, and this is what I got:
chr1 1 . N <NON_REF> . . END=248956422 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
chr10 1 . N <NON_REF> . . END=133797422 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
chr11 1 . N <NON_REF> . . END=135086622 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
chr11_KI270721v1_random 1 . C <NON_REF> . . END=100316 GT:DP:GQ:MIN_DP:PL
0/0:0:0:0:0,0,0
chr12 1 . N <NON_REF> . . END=133275309 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
chr13 1 . N <NON_REF> . . END=114364328 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
chr14 1 . N <NON_REF> . . END=107043718 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
chr14_GL000009v2_random 1 . G <NON_REF> . . END=201709 GT:DP:GQ:MIN_DP:PL
0/0:0:0:0:0,0,0
chr14_GL000225v1_random 1 . G <NON_REF> . . END=211173 GT:DP:GQ:MIN_DP:PL
0/0:0:0:0:0,0,0
chr14_KI270722v1_random 1 . G <NON_REF> . . END=194050 GT:DP:GQ:MIN_DP:PL -
This result is quite unexpected. It looks like your GVCFs only contain a single confidence interval that encompasses the whole chromosomes with HOM_REF therefore you don't get anything out of your genotyping. Are you sure your bam files are not empty?
How did you generate these GVCF files? Can you share your former steps with a little more detail?
Regards.
-
I did not check the bam files, but I will now.
To generate the GVCF files i followed these steps:
1. Alignment with STAR:
STAR_options="--runThreadN 20 \
--limitGenomeGenerateRAM 100000000000 \
--genomeDir /home/wovalle3/Samples/GRCh38_Gencode31/Genomeindex \
--outSAMunmapped Within \
--outFilterType BySJout \
--outFilterMultimapNmax 20 \
ismatchNmax 999 \
--outFilterMismatchNoverLmax 0.04 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000 \
--alignSJoverhangMin 8 \
--limitSjdbInsertNsj 2500000 \
--alignSJDBoverhangMin 1 \
--sjdbScore 1 \
--outFilterMatchNminOverLread 0.66 \
--outSAMtype BAM Unsorted --quantMode TranscriptomeSAM GeneCounts \
--peOverlapNbasesMin 10 --alignEndsProtrude 10 ConcordantPair"
for sample in "${samples[@]}"; do
R1="${sample}_nonrrna_R1.fastq.gz"
R2="${sample}_nonrrna_R2.fastq.gz"
STAR $STAR_options --readFilesCommand zcat --readFilesIn "$input_dir/$R1" "$input_dir/$R2" --outFileNamePrefix "${sample}_" 1> "${sample}_output.log"
echo "STAR completado para la muestra $sample"
done2. AddOrReplaceReadGroups to mark reads
for bam_file in "${bam_dir}"/*.bam; do
base_name=$(basename "$bam_file" .bam)
java -jar "${picard}" AddOrReplaceReadGroups \
I="$bam_file" \
O="${output_dir}/${base_name}-wRG.bam" \
SO=coordinate \
CREATE_INDEX=true \
RGID=K00171 \
RGLB=SureSelectXT \
RGPL=illumina \
RGPU=K00171 \
RGSM="${base_name}"
done
echo "Archivo ${base_name}.bam ha sido procesado."
done
echo "La marca de grupos de lecturas y la creación de índices han sido completadas."3. Mark duplicates
for bam_file in "${bam_dir}"/*-wRG.bam; do
base_name=$(basename "$bam_file" -wRG.bam)
salida="${base_name}-noDups.bam"
### Marcar duplicados
"$java" -jar "$picard" MarkDuplicates \
I="$bam_file" \
O="$output_dir/$salida" \
M="$output_dir/${salida%.bam}.metrics" \
REMOVE_DUPLICATES=true
done4. AddOrReplaceReadGroups for non duplicates bam files
for bam_file in "${bam_dir}"/*-noDups.bam; do
base_name=$(basename "$bam_file" -noDups.bam)
salida="${base_name}-sorted2.bam"
"$java" -jar "$picard" AddOrReplaceReadGroups \
I="$bam_file" \
O="$output_dir/$salida" \
SO=coordinate \
CREATE_INDEX=true \
RGID=K00171 \
RGLB=SureSelectXT \
RGPL=illumina \
RGPU=K00171 \
RGSM="${base_name}"
done5. Base recalibration
for bam_file in "$output_dir"/*-sorted2.bam; do
base_name=$(basename "$bam_file" "-sorted2.bam")
output_table="${base_name}-recal_data.table"
# Ejecutar BaseRecalibrator
gatk BaseRecalibrator \
-R "$reference" \
-I "$bam_file" \
--known-sites "$dbsnp" \
-O "$output_table" \
--use-original-qualities
done6. ApplyBQSR for recalibraiton of BAM files
for bam_file in "$output_dir"/*-sorted2.bam; do
base_name=$(basename "$bam_file" "-sorted2.bam")
recal_table="${output_dir}/${base_name}-recal_data.table"
# Ejecutar ApplyBQSR
gatk ApplyBQSR \
-I "$bam_file" \
-R "$reference" \
--bqsr-recal-file "$recal_table" \
-O "${output_dir}/${base_name}-BQSR.bam"
done6. HaplotypeCaller to obtain the VCF files
for bam_file in "$output_dir"/*-BQSR.bam; do
sample_name=$(basename "$bam_file" -BQSR.bam)
gatk --java-options "-Xmx35g" HaplotypeCaller \
-R "$reference" \
-I "$bam_file" \
-O "$output_dir/gvcfs/${sample_name}.g.vcf.gz" \
-ERC GVCF \
--dbsnp "$dbsnp_file" \
--native-pair-hmm-threads "$num_threads"
doneAfter obtaining these first VCF files, I did the GenomicsDBImport and then the GenotypeGVCFs steps as I described later, and started to get errors, I guess because of the empty files.
Sorry for the long response, hopefully, you can guide me with what I am doing wrong.
Thanks for the help
-
i checked the bam files and they do have information. I obtained read counts from them and also ran various quality metrics and they seem good. Maybe it is something else
-
Bingo! These are STAR aligned transcriptome bam files which marks unique mapping reads as with MAPQ 255. 255 score is actually no different than MAPQ 0 for GATK tools also HaplotypeCaller does not like reads with splice tags ("N") within CIGARs therefore you need to readjust your mapping quality scores using SplitNCigarReads tool before you can start calling variants using HaplotypeCaller.
We have a small article on how to call short variants from RNASeq data in the link below
I hope this helps.
-
Thank you so much for the help. I will implement this new step as you suggested.
Please sign in to leave a comment.
9 comments