Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenotypeGVCFs generates an empty VCF file with all my samples

0

9 comments

  • Avatar
    Gökalp Çelik

    Hi Laura Rey Vargas

    Can you try adding 

    --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true'

    so that you may be able to see if there are any underlying hidden error messages?

    One recommendation. Instead of injecting bash script calls into the command line generation you can use the below parameter to import your GVCF files with least pain.

    --sample-name-map samples.list

    samples.list file should look like below

    sample1 /path/to/sample1.g.vcf.gz
    sample2 /path/to/sample2.g.vcf.gz

    fields must be separated by tabs. 

    Also can you try running your import with the latest version of 4.5.0.0 if possible?

    I hope this helps. 

    0
    Comment actions Permalink
  • Avatar
    Laura Rey Vargas

    Thank you for the response, however, your suggestions did not work.

    I tried running GenotypeGVCFs with only one of the vcf files I got from the HaplotypeCaller, instead of the input gendb://my_database from GenomicsDBImport, just to check if the problem was because of mixed-up paths, but I'm still getting an empty vcf. 

    this was the command line I used:

     gatk --java-options "-Xmx4g" GenotypeGVCFs \
       -R /home/wovalle3/Samples/GRCh38_Gencode31/hg38.fa \
       -V NT-260_Aligned.out.g.vcf.gz \
       -O output_prueba.vcf.gz
    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Laura Rey Vargas

    It is interesting that you are still getting an empty file when genotyping a single one. Have you received any error messages during the preparation of those GVCF files? 

    Can you check the contents of the GVCF file and share a few lines using the command below and share with us?

    bcftools view -H NT-260_Aligned.out.g.vcf.gz | head -n 10

    Regards. 

     

    0
    Comment actions Permalink
  • Avatar
    Laura Rey Vargas

    As far as I checked, HaplotypeCaller did not present any errors. I ran the command line you suggested in only one of the samples, and this is what I got:

    chr1    1       .       N       <NON_REF>       .       .       END=248956422   GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
    chr10   1       .       N       <NON_REF>       .       .       END=133797422   GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
    chr11   1       .       N       <NON_REF>       .       .       END=135086622   GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
    chr11_KI270721v1_random 1       .       C       <NON_REF>       .       .       END=100316      GT:DP:GQ:MIN_DP:PL
      0/0:0:0:0:0,0,0
    chr12   1       .       N       <NON_REF>       .       .       END=133275309   GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
    chr13   1       .       N       <NON_REF>       .       .       END=114364328   GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
    chr14   1       .       N       <NON_REF>       .       .       END=107043718   GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
    chr14_GL000009v2_random 1       .       G       <NON_REF>       .       .       END=201709      GT:DP:GQ:MIN_DP:PL
      0/0:0:0:0:0,0,0
    chr14_GL000225v1_random 1       .       G       <NON_REF>       .       .       END=211173      GT:DP:GQ:MIN_DP:PL
      0/0:0:0:0:0,0,0
    chr14_KI270722v1_random 1       .       G       <NON_REF>       .       .       END=194050      GT:DP:GQ:MIN_DP:PL

     

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Laura Rey Vargas

    This result is quite unexpected. It looks like your GVCFs only contain a single confidence interval that encompasses the whole chromosomes with HOM_REF therefore you don't get anything out of your genotyping. Are you sure your bam files are not empty?

    How did you generate these GVCF files? Can you share your former steps with a little more detail?

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Laura Rey Vargas

    I did not check the bam files, but I will now. 

    To generate the GVCF files i followed these steps:

    1. Alignment with STAR:

    STAR_options="--runThreadN 20 \
    --limitGenomeGenerateRAM 100000000000 \
    --genomeDir /home/wovalle3/Samples/GRCh38_Gencode31/Genomeindex \
    --outSAMunmapped Within \
    --outFilterType BySJout \
    --outFilterMultimapNmax 20 \
    ismatchNmax 999 \
    --outFilterMismatchNoverLmax 0.04 \
    --alignIntronMin 20 \
    --alignIntronMax 1000000 \
    --alignMatesGapMax 1000000 \
    --alignSJoverhangMin 8 \
    --limitSjdbInsertNsj 2500000 \
    --alignSJDBoverhangMin 1 \
    --sjdbScore 1 \
    --outFilterMatchNminOverLread 0.66 \
    --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM GeneCounts \
    --peOverlapNbasesMin 10 --alignEndsProtrude 10 ConcordantPair"

    for sample in "${samples[@]}"; do
        R1="${sample}_nonrrna_R1.fastq.gz"
        R2="${sample}_nonrrna_R2.fastq.gz"

    STAR $STAR_options --readFilesCommand zcat --readFilesIn "$input_dir/$R1" "$input_dir/$R2" --outFileNamePrefix "${sample}_" 1> "${sample}_output.log"

    echo "STAR completado para la muestra $sample"
    done

    2. AddOrReplaceReadGroups to mark reads 

    for bam_file in "${bam_dir}"/*.bam; do
        base_name=$(basename "$bam_file" .bam)
        
      java -jar "${picard}" AddOrReplaceReadGroups \
            I="$bam_file" \
            O="${output_dir}/${base_name}-wRG.bam" \
            SO=coordinate \
            CREATE_INDEX=true \
            RGID=K00171 \
            RGLB=SureSelectXT \
            RGPL=illumina \
            RGPU=K00171 \
            RGSM="${base_name}"
    done
        echo "Archivo ${base_name}.bam ha sido procesado."
    done
    echo "La marca de grupos de lecturas y la creación de índices han sido completadas."

    3. Mark duplicates 

    for bam_file in "${bam_dir}"/*-wRG.bam; do
        base_name=$(basename "$bam_file" -wRG.bam)
        salida="${base_name}-noDups.bam"
        
    ### Marcar duplicados
    "$java" -jar "$picard" MarkDuplicates \
             I="$bam_file" \
             O="$output_dir/$salida" \
             M="$output_dir/${salida%.bam}.metrics" \
             REMOVE_DUPLICATES=true
    done

    4. AddOrReplaceReadGroups for non duplicates bam files

    for bam_file in "${bam_dir}"/*-noDups.bam; do
        base_name=$(basename "$bam_file" -noDups.bam)
     salida="${base_name}-sorted2.bam"

         "$java" -jar "$picard" AddOrReplaceReadGroups \
                I="$bam_file" \
                O="$output_dir/$salida" \
                SO=coordinate \
                CREATE_INDEX=true \
                RGID=K00171 \
                RGLB=SureSelectXT \
                RGPL=illumina \
                RGPU=K00171 \
                RGSM="${base_name}"
    done

    5. Base recalibration

    for bam_file in "$output_dir"/*-sorted2.bam; do
        base_name=$(basename "$bam_file" "-sorted2.bam")
        output_table="${base_name}-recal_data.table"

        # Ejecutar BaseRecalibrator
        gatk BaseRecalibrator \
            -R "$reference" \
            -I "$bam_file" \
            --known-sites "$dbsnp" \
            -O "$output_table" \
            --use-original-qualities
    done

    6. ApplyBQSR for recalibraiton of BAM files

    for bam_file in "$output_dir"/*-sorted2.bam; do
        base_name=$(basename "$bam_file" "-sorted2.bam")
        recal_table="${output_dir}/${base_name}-recal_data.table"
        
        # Ejecutar ApplyBQSR
        gatk ApplyBQSR \
            -I "$bam_file" \
            -R "$reference" \
            --bqsr-recal-file "$recal_table" \
            -O "${output_dir}/${base_name}-BQSR.bam"     
    done

    6. HaplotypeCaller to obtain the VCF files 

    for bam_file in "$output_dir"/*-BQSR.bam; do
      sample_name=$(basename "$bam_file" -BQSR.bam)

      gatk --java-options "-Xmx35g" HaplotypeCaller \
        -R "$reference" \
        -I "$bam_file" \
        -O "$output_dir/gvcfs/${sample_name}.g.vcf.gz" \
        -ERC GVCF \
        --dbsnp "$dbsnp_file" \
        --native-pair-hmm-threads "$num_threads" 
    done

    After obtaining these first VCF files, I did the GenomicsDBImport and then the GenotypeGVCFs steps as I described later, and started to get errors, I guess because of the empty files. 

    Sorry for the long response, hopefully, you can guide me with what I am doing wrong.

    Thanks for the help

     

     

    0
    Comment actions Permalink
  • Avatar
    Laura Rey Vargas

    i checked the bam files and they do have information. I obtained read counts from them and also ran various quality metrics and they seem good. Maybe it is something else

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Bingo! These are STAR aligned transcriptome bam files which marks unique mapping reads as with MAPQ 255. 255 score is actually no different than MAPQ 0 for GATK tools also HaplotypeCaller does not like reads with splice tags ("N") within CIGARs therefore you need to readjust your mapping quality scores using SplitNCigarReads tool before you can start calling variants using HaplotypeCaller. 

    We have a small article on how to call short variants from RNASeq data in the link below

    https://gatk.broadinstitute.org/hc/en-us/articles/360035531192-RNAseq-short-variant-discovery-SNPs-Indels 

    I hope this helps. 

    0
    Comment actions Permalink
  • Avatar
    Laura Rey Vargas

    Thank you so much for the help. I will implement this new step as you suggested.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk