HaplotypeCaller for haploid genome
Hi, any help is appreciated. I am using HaplotypeCaller on a haploid organism. Currently I am using it for a single sample but I need to use it followed by using combine and genotype vcfs because I have multiple samples. Version is GATK/4.1.4.0. Command line
gatk HaplotypeCaller -I P1.bam -R ref.fasta -ERC GVCF -ploidy 1 -O output.g.vcf
I am attaching a screenshot of the error log. My output.g.vcf's last column is "20" instead of the sample name. Why is that? Also, why do I have <NON_REF> under the column ALT? First few lines of output file are:
##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller --sample-ploidy 1 --emit-ref-confidence GVCF --output 15St008.g.vcf --input 15St008markdup_tag.bam --reference ../genotyping_progeny/Et28Aref.fasta --use-new-qual-calculator true --annotate-with-num-discovered-alleles false --heterozygosity 0.001 --indel-heterozygosity 1.25E-4 --heterozygosity-stdev 0.01 --standard-min-confidence-threshold-for-calling 30.0 --max-alternate-alleles 6 --max-genotype-count 1024 --num-reference-samples-if-no-call 0 --contamination-fraction-to-filter 0.0 --output-mode EMIT_VARIANTS_ONLY --all-site-pls false --gvcf-gq-bands 1 --gvcf-gq-bands 2 --gvcf-gq-bands 3 --gvcf-gq-bands 4 --gvcf-gq-bands 5 --gvcf-gq-bands 6 --gvcf-gq-bands 7 --gvcf-gq-bands 8 --gvcf-gq-bands 9 --gvcf-gq-bands 10 --gvcf-gq-bands 11 --gvcf-gq-bands 12 --gvcf-gq-bands 13 --gvcf-gq-bands 14 --gvcf-gq-bands 15 --gvcf-gq-bands 16 --gvcf-gq-bands 17 --gvcf-gq-bands 18 --gvcf-gq-bands 19 --gvcf-gq-bands 20 --gvcf-gq-bands 21 --gvcf-gq-bands 22 --gvcf-gq-bands 23 --gvcf-gq-bands 24 --gvcf-gq-bands 25 --gvcf-gq-bands 26 --gvcf-gq-bands 27 --gvcf-gq-bands 28 --gvcf-gq-bands 29 --gvcf-gq-bands 30 --gvcf-gq-bands 31 --gvcf-gq-bands 32 --gvcf-gq-bands 33 --gvcf-gq-bands 34 --gvcf-gq-bands 35 --gvcf-gq-bands 36 --gvcf-gq-bands 37 --gvcf-gq-bands 38 --gvcf-gq-bands 39 --gvcf-gq-bands 40 --gvcf-gq-bands 41 --gvcf-gq-bands 42 --gvcf-gq-bands 43 --gvcf-gq-bands 44 --gvcf-gq-bands 45 --gvcf-gq-bands 46 --gvcf-gq-bands 47 --gvcf-gq-bands 48 --gvcf-gq-bands 49 --gvcf-gq-bands 50 --gvcf-gq-bands 51 --gvcf-gq-bands 52 --gvcf-gq-bands 53 --gvcf-gq-bands 54 --gvcf-gq-bands 55 --gvcf-gq-bands 56 --gvcf-gq-bands 57 --gvcf-gq-bands 58 --gvcf-gq-bands 59 --gvcf-gq-bands 60 --gvcf-gq-bands 70 --gvcf-gq-bands 80 --gvcf-gq-bands 90 --gvcf-gq-bands 99 --floor-blocks false --indel-size-to-eliminate-in-ref-model 10 --disable-optimizations false --just-determine-active-regions false --dont-genotype false --do-not-run-physical-phasing false --use-filtered-reads-for-annotations false --correct-overlapping-quality false --adaptive-pruning false --do-not-recover-dangling-branches false --recover-dangling-heads false --dont-trim-active-regions false --max-extension 25 --padding-around-indels 150 --padding-around-snps 20 --kmer-size 10 --kmer-size 25 --dont-increase-kmer-sizes-for-cycles false --allow-non-unique-kmers-in-ref false --num-pruning-samples 1 --min-dangling-branch-length 4 --recover-all-dangling-branches false --max-num-haplotypes-in-population 128 --min-pruning 2 --adaptive-pruning-initial-error-rate 0.001 --pruning-lod-threshold 2.302585092994046 --max-unpruned-variants 100 --debug-assembly false --debug-graph-transformations false --capture-assembly-failure-bam false --error-correct-reads false --kmer-length-for-read-error-correction 25 --min-observations-for-kmer-to-be-solid 20 --likelihood-calculation-engine PairHMM --base-quality-score-threshold 18 --pair-hmm-gap-continuation-penalty 10 --pair-hmm-implementation FASTEST_AVAILABLE --pcr-indel-model CONSERVATIVE --phred-scaled-global-read-mismapping-rate 45 --native-pair-hmm-threads 4 --native-pair-hmm-use-double-precision false --bam-writer-type CALLED_HAPLOTYPES --dont-use-soft-clipped-bases false --min-base-quality-score 10 --smith-waterman JAVA --max-mnp-distance 0 --force-call-filtered-alleles false --min-assembly-region-size 50 --max-assembly-region-size 300 --assembly-region-padding 100 --max-reads-per-alignment-start 50 --active-probability-threshold 0.002 --max-prob-propagation-distance 50 --force-active false --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false --minimum-mapping-quality 20 --disable-tool-default-annotations false --enable-all-annotations false --allow-old-rms-mapping-quality-annotation-data false",Version="4.1.4.0",Date="September 5, 2020 9:36:43 PM CDT">
##GVCFBlock0-1=minGQ=0(inclusive),maxGQ=1(exclusive)
##GVCFBlock1-2=minGQ=1(inclusive),maxGQ=2(exclusive)
##GVCFBlock10-11=minGQ=10(inclusive),maxGQ=11(exclusive)
##GVCFBlock11-12=minGQ=11(inclusive),maxGQ=12(exclusive)
##GVCFBlock12-13=minGQ=12(inclusive),maxGQ=13(exclusive)
##GVCFBlock13-14=minGQ=13(inclusive),maxGQ=14(exclusive)
##GVCFBlock14-15=minGQ=14(inclusive),maxGQ=15(exclusive)
##GVCFBlock15-16=minGQ=15(inclusive),maxGQ=16(exclusive)
##GVCFBlock16-17=minGQ=16(inclusive),maxGQ=17(exclusive)
##GVCFBlock17-18=minGQ=17(inclusive),maxGQ=18(exclusive)
##GVCFBlock18-19=minGQ=18(inclusive),maxGQ=19(exclusive)
##GVCFBlock19-20=minGQ=19(inclusive),maxGQ=20(exclusive)
##GVCFBlock2-3=minGQ=2(inclusive),maxGQ=3(exclusive)
##GVCFBlock20-21=minGQ=20(inclusive),maxGQ=21(exclusive)
##GVCFBlock21-22=minGQ=21(inclusive),maxGQ=22(exclusive)
##GVCFBlock22-23=minGQ=22(inclusive),maxGQ=23(exclusive)
##GVCFBlock23-24=minGQ=23(inclusive),maxGQ=24(exclusive)
##GVCFBlock24-25=minGQ=24(inclusive),maxGQ=25(exclusive)
##GVCFBlock25-26=minGQ=25(inclusive),maxGQ=26(exclusive)
##GVCFBlock26-27=minGQ=26(inclusive),maxGQ=27(exclusive)
##GVCFBlock27-28=minGQ=27(inclusive),maxGQ=28(exclusive)
##GVCFBlock28-29=minGQ=28(inclusive),maxGQ=29(exclusive)
##GVCFBlock29-30=minGQ=29(inclusive),maxGQ=30(exclusive)
##GVCFBlock3-4=minGQ=3(inclusive),maxGQ=4(exclusive)
##GVCFBlock30-31=minGQ=30(inclusive),maxGQ=31(exclusive)
##GVCFBlock31-32=minGQ=31(inclusive),maxGQ=32(exclusive)
##GVCFBlock32-33=minGQ=32(inclusive),maxGQ=33(exclusive)
##GVCFBlock33-34=minGQ=33(inclusive),maxGQ=34(exclusive)
##GVCFBlock34-35=minGQ=34(inclusive),maxGQ=35(exclusive)
##GVCFBlock35-36=minGQ=35(inclusive),maxGQ=36(exclusive)
##GVCFBlock36-37=minGQ=36(inclusive),maxGQ=37(exclusive)
##GVCFBlock37-38=minGQ=37(inclusive),maxGQ=38(exclusive)
##GVCFBlock38-39=minGQ=38(inclusive),maxGQ=39(exclusive)
##GVCFBlock39-40=minGQ=39(inclusive),maxGQ=40(exclusive)
##GVCFBlock4-5=minGQ=4(inclusive),maxGQ=5(exclusive)
##GVCFBlock40-41=minGQ=40(inclusive),maxGQ=41(exclusive)
##GVCFBlock41-42=minGQ=41(inclusive),maxGQ=42(exclusive)
##GVCFBlock42-43=minGQ=42(inclusive),maxGQ=43(exclusive)
##GVCFBlock43-44=minGQ=43(inclusive),maxGQ=44(exclusive)
##GVCFBlock44-45=minGQ=44(inclusive),maxGQ=45(exclusive)
##GVCFBlock45-46=minGQ=45(inclusive),maxGQ=46(exclusive)
##GVCFBlock46-47=minGQ=46(inclusive),maxGQ=47(exclusive)
##GVCFBlock47-48=minGQ=47(inclusive),maxGQ=48(exclusive)
##GVCFBlock48-49=minGQ=48(inclusive),maxGQ=49(exclusive)
##GVCFBlock49-50=minGQ=49(inclusive),maxGQ=50(exclusive)
##GVCFBlock5-6=minGQ=5(inclusive),maxGQ=6(exclusive)
##GVCFBlock50-51=minGQ=50(inclusive),maxGQ=51(exclusive)
##GVCFBlock51-52=minGQ=51(inclusive),maxGQ=52(exclusive)
##GVCFBlock52-53=minGQ=52(inclusive),maxGQ=53(exclusive)
##GVCFBlock53-54=minGQ=53(inclusive),maxGQ=54(exclusive)
##GVCFBlock54-55=minGQ=54(inclusive),maxGQ=55(exclusive)
##GVCFBlock55-56=minGQ=55(inclusive),maxGQ=56(exclusive)
##GVCFBlock56-57=minGQ=56(inclusive),maxGQ=57(exclusive)
##GVCFBlock57-58=minGQ=57(inclusive),maxGQ=58(exclusive)
##GVCFBlock58-59=minGQ=58(inclusive),maxGQ=59(exclusive)
##GVCFBlock59-60=minGQ=59(inclusive),maxGQ=60(exclusive)
##GVCFBlock6-7=minGQ=6(inclusive),maxGQ=7(exclusive)
##GVCFBlock60-70=minGQ=60(inclusive),maxGQ=70(exclusive)
##GVCFBlock7-8=minGQ=7(inclusive),maxGQ=8(exclusive)
##GVCFBlock70-80=minGQ=70(inclusive),maxGQ=80(exclusive)
##GVCFBlock8-9=minGQ=8(inclusive),maxGQ=9(exclusive)
##GVCFBlock80-90=minGQ=80(inclusive),maxGQ=90(exclusive)
##GVCFBlock9-10=minGQ=9(inclusive),maxGQ=10(exclusive)
##GVCFBlock90-99=minGQ=90(inclusive),maxGQ=99(exclusive)
##GVCFBlock99-100=minGQ=99(inclusive),maxGQ=100(exclusive)
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=RAW_MQandDP,Number=2,Type=Integer,Description="Raw data (sum of squared MQ and total depth) for improved RMS Mapping Quality calculation. Incompatible with deprecated RAW_MQ formulation.">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=CP054636.1,length=2015427>
##contig=<ID=CP054637.1,length=1782973>
##contig=<ID=CP054638.1,length=1666431>
##contig=<ID=CP054639.1,length=1660169>
##contig=<ID=CP054640.1,length=1385457>
##contig=<ID=CP054641.1,length=1354400>
##contig=<ID=CP054642.1,length=1267693>
##contig=<ID=CP054643.1,length=1179031>
##contig=<ID=CP054644.1,length=1115805>
##contig=<ID=CP054645.1,length=923001>
##contig=<ID=CP054627.1,length=3566547>
##contig=<ID=CP054646.1,length=878248>
##contig=<ID=CP054647.1,length=826205>
##contig=<ID=CP054648.1,length=703354>
##contig=<ID=CP054649.1,length=678853>
##contig=<ID=CP054650.1,length=656216>
##contig=<ID=CP054651.1,length=389826>
##contig=<ID=CP054652.1,length=293105>
##contig=<ID=CP054653.1,length=115922>
##contig=<ID=CP054654.1,length=68497>
##contig=<ID=CP054655.1,length=58580>
##contig=<ID=CP054628.1,length=3390549>
##contig=<ID=CP054656.1,length=55715>
##contig=<ID=CP054629.1,length=3064236>
##contig=<ID=CP054630.1,length=3003811>
##contig=<ID=CP054631.1,length=2346863>
##contig=<ID=CP054632.1,length=2346177>
##contig=<ID=CP054633.1,length=2278376>
##contig=<ID=CP054634.1,length=2274269>
##contig=<ID=CP054635.1,length=2134525>
##source=HaplotypeCaller
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 20
CP054636.1 1 . G <NON_REF> . . END=15142 GT:DP:GQ:MIN_DP:PL 0:0:0:0:0,0
CP054636.1 15143 . A <NON_REF> . . END=15188 GT:DP:GQ:MIN_DP:PL 0:1:42:1:0,42
CP054636.1 15189 . T <NON_REF> . . END=15189 GT:DP:GQ:MIN_DP:PL 0:2:72:2:0,72
CP054636.1 15190 . T <NON_REF> . . END=15200 GT:DP:GQ:MIN_DP:PL 0:2:84:2:0,84
CP054636.1 15201 . A <NON_REF> . . END=15201 GT:DP:GQ:MIN_DP:PL 0:2:72:2:0,72
CP054636.1 15202 . T <NON_REF> . . END=15206 GT:DP:GQ:MIN_DP:PL 0:2:84:2:0,84
CP054636.1 15207 . G <NON_REF> . . END=15292 GT:DP:GQ:MIN_DP:PL 0:5:99:3:0,99
CP054636.1 15293 . G <NON_REF> . . END=15293 GT:DP:GQ:MIN_DP:PL 0:4:84:4:0,84
CP054636.1 15294 . A <NON_REF> . . END=15410 GT:DP:GQ:MIN_DP:PL 0:5:99:3:0,99
CP054636.1 15411 . T <NON_REF> . . END=15411 GT:DP:GQ:MIN_DP:PL 0:3:87:3:0,87
CP054636.1 15412 . G <NON_REF> . . END=15412 GT:DP:GQ:MIN_DP:PL 0:3:99:3:0,113
CP054636.1 15413 . A <NON_REF> . . END=15413 GT:DP:GQ:MIN_DP:PL 0:3:68:3:0,68
CP054636.1 15414 . A <NON_REF> . . END=15423 GT:DP:GQ:MIN_DP:PL 0:3:99:3:0,99
CP054636.1 15424 . T <NON_REF> . . END=15424 GT:DP:GQ:MIN_DP:PL 0:3:68:3:0,68
CP054636.1 15425 . A <NON_REF> . . END=15425 GT:DP:GQ:MIN_DP:PL 0:3:42:3:0,42
CP054636.1 15426 . C <NON_REF> . . END=15428 GT:DP:GQ:MIN_DP:PL 0:3:99:3:0,113
CP054636.1 15429 . A <NON_REF> . . END=15429 GT:DP:GQ:MIN_DP:PL 0:3:73:3:0,73
CP054636.1 15430 . T <NON_REF> . . END=15433 GT:DP:GQ:MIN_DP:PL 0:3:99:3:0,99
CP054636.1 15434 . A <NON_REF> . . END=15434 GT:DP:GQ:MIN_DP:PL 0:3:68:3:0,68
CP054636.1 15435 . A <NON_REF> . . END=15435 GT:DP:GQ:MIN_DP:PL 0:3:87:3:0,87
CP054636.1 15436 . C <NON_REF> . . END=15443 GT:DP:GQ:MIN_DP:PL 0:3:45:3:0,45
CP054636.1 15444 . G <NON_REF> . . END=15463 GT:DP:GQ:MIN_DP:PL 0:1:0:0:0,0
CP054636.1 15464 . C <NON_REF> . . END=15500 GT:DP:GQ:MIN_DP:PL 0:1:42:1:0,42
CP054636.1 15501 . A <NON_REF> . . END=15501 GT:DP:GQ:MIN_DP:PL 0:1:15:1:0,15
CP054636.1 15502 . A <NON_REF> . . END=15504 GT:DP:GQ:MIN_DP:PL 0:1:42:1:0,42
CP054636.1 15505 . T <NON_REF> . . END=15505 GT:DP:GQ:MIN_DP:PL 0:1:15:1:0,15
CP054636.1 15506 . C <NON_REF> . . END=15507 GT:DP:GQ:MIN_DP:PL 0:1:42:1:0,42
CP054636.1 15508 . C <NON_REF> . . END=15508 GT:DP:GQ:MIN_DP:PL 0:1:30:1:0,30
CP054636.1 15509 . T <NON_REF> . . END=15509 GT:DP:GQ:MIN_DP:PL 0:1:42:1:0,42
CP054636.1 15510 . T <NON_REF> . . END=15511 GT:DP:GQ:MIN_DP:PL 0:1:30:1:0,30
CP054636.1 15512 . C <NON_REF> . . END=15548 GT:DP:GQ:MIN_DP:PL 0:1:42:1:0,42
CP054636.1 15549 . A <NON_REF> . . END=15549 GT:DP:GQ:MIN_DP:PL 0:2:72:2:0,72
CP054636.1 15550 . A <NON_REF> . . END=15550 GT:DP:GQ:MIN_DP:PL 0:2:84:2:0,84
CP054636.1 15551 . G <NON_REF> . . END=15551 GT:DP:GQ:MIN_DP:PL 0:2:57:2:0,57
CP054636.1 15552 . C <NON_REF> . . END=15552 GT:DP:GQ:MIN_DP:PL 0:2:84:2:0,84
CP054636.1 15553 . G <NON_REF> . . END=15553 GT:DP:GQ:MIN_DP:PL 0:2:57:2:0,57
CP054636.1 15554 . A <NON_REF> . . END=15561 GT:DP:GQ:MIN_DP:PL 0:2:84:2:0,84
CP054636.1 15562 . C <NON_REF> . . END=15562 GT:DP:GQ:MIN_DP:PL 0:2:57:2:0,57
CP054636.1 15563 . G <NON_REF> . . END=15563 GT:DP:GQ:MIN_DP:PL 0:2:84:2:0,84
CP054636.1 15564 . C <NON_REF> . . END=15564 GT:DP:GQ:MIN_DP:PL 0:2:57:2:0,57
-
Hello Pummi Singh, <NON_REF> occurs in GVCFs. You can find more information here: GVCF - Genomic Variant Call Format.
I do not see any issues in your error log but it also does not look like the complete stack trace.
-
Thank you Genevieve. Why is the header of the last column of my output.g.vcf "20" and not the sample name? Every time I run a different sample, this is the case and I am unable to use GenomicDBImport for this reason (duplicate column error).
-
Hi Pummi Singh, please post your entire stack trace to look for errors.
I would also recommend running ValidateSamFile on your input BAM (P1.bam) to check for issues with the file. You may also want to read this document about read groups because incorrect read groups will lead to issues.
-
Hi Genevieve,
This comment box does not allow me to post the entire stack trace and I cannot upload anything but a jpg file.
Thanks for suggesting Validatesamfile. I tried that and I have the following error:
## HISTOGRAM java.lang.String
Error Type Count
ERROR:MATE_CIGAR_STRING_INVALID_PRESENCE 135117
ERROR:MATE_NOT_FOUND 1278150My bam file is indexed, sorted, fixmated, duplicates removed and reads tagged. I am not sure why do I have this error and if I should just ignore it. Please suggest. Thanks
-
The tutorial I linked to above has some more information on these errors. They usually need to be fixed before use with GATK tools.
I will need to see the entire HaplotypeCaller error log to determine if a problem exists there. You can also search the stack trace for errors or warnings to determine if any caused the sample ID to appear as "20".
Did you check your read groups? Is the SM "20" in those?
-
Yes, the SM was indeed "20" in the read groups and I have now fixed this issue. I can see sample names finally. Thanks a ton.
-
Glad you fixed it! Thanks for updating with your solution.
Please sign in to leave a comment.
7 comments