Difference between GATK3.3 and GATK4.0
I used GATK4.0 HaplotypeCaller GVCF model to call WES SNV and InDel, then I found two variants in a family which can be seen in the IGV but missed in HaplotypeCaller step.Below is my command.
1.BWA
bwa mem -t 10 -K 10000000 ucsc.hg19.fasta Sample.R1.clean.fastq.gz Sample.R2.clean.fastq.gz > Sample.sam
2.Sort
java -jar picard.jar SortSam I=Sample.sam O=Sample.sorted.bam SORT_ORDER=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true
3.ARRG
java -jar picard.jar AddOrReplaceReadGroups I=Sample.sorted.bam O=Sample.ARRG.bam SORT_ORDER=coordinate RGID=foo RGLB=bar RGPU=bar VALIDATION_STRINGENCY=LENIENT RGPL=illumina RGSM=DePristo CREATE_INDEX=True
4.MarkDup
java -jar picard.jar MarkDuplicates I=Sample.ARRG.bam O=Sample.dedupped.bam M=Sample.dedupped.metrics CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT
5.BQSR
gatk BaseRecalibrator -I Sample.dedupped.bam -R ucsc.hg19.fasta --known-sites dbsnp_138.hg19.vcf --known-sites Mills_and_1000G_gold_standard.indels.hg19.sites.vcf --known-sites 1000G_phase1.indels.hg19.sites.vcf -O Sample.recal.table
6.ApplyBQSR
gatk ApplyBQSR -I Sample.dedupped.bam -R /data/hm/Genome/ucsc.hg19.fasta --bqsr-recal-file Sample.recal.table -O Sample.BQSR.bam
7.HaplotypeCaller
gatk --java-options '-Xmx4g' HaplotypeCaller -R ucsc.hg19.fasta -L genes.interval_list -I Sample.BQSR.bam -O Sample.g.vcf.gz -ERC GVCF
8.GenotypeGVCFs
gatk --java-options '-Xmx4g' GenotypeGVCFs -R ucsc.hg19.fasta -V Sample.g.vcf.gz -O Sample.vcf
I can't see the two variants in Sample.vcf. So I use the GATK3.3 UnifiedGenotyper model to call variants again.Below is my command.
java -jar GATK3.3-0/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T UnifiedGenotyper -nt 10 -I Sample.BQSR.bam -o Sample.vcf -stand_call_conf 30.0 -stand_emit_conf 0 -glm BOTH -rf BadCigar -L genes.interval_list
This time I can see both of them, below is the variant records by GATK3.3 UnifiedGenotyper model,
chr13 113771911 . G GC 2451.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=-3.630;DP=170;FS=5.640;MLEAC=1;MLEAF=0.500;MQ=58.69;MQ0=0;MQRankSum=1.344;QD=14.42;ReadPosRankSum=-0.249;SOR=1.149 GT:AD:DP:GQ:PL 0/1:83,83:167:99:2489,0,3179
chr13 113771912 . T G 1496.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.153;DP=161;Dels=0.00;FS=5.876;HaplotypeScore=20.2006;MLEAC=1;MLEAF=0.500;MQ=58.61;MQ0=2;MQRankSum=1.827;QD=9.30;ReadPosRankSum=0.982;SOR=1.129 GT:AD:DP:GQ:PL 0/1:83,72:161:99:1525,0,1716
chr13 113771962 . T C 3093.77 . AC=2;AF=1.00;AN=2;DP=115;Dels=0.01;FS=0.000;HaplotypeScore=22.6914;MLEAC=2;MLEAF=1.00;MQ=55.94;MQ0=4;QD=26.90;SOR=1.329 GT:AD:DP:GQ:PL 1/1:0,113:114:99:3122,235,0
I check the variants by IGV, as below:
I think the two variants are very obvious.I want to know why UnifiedGenotyper model can find them but HaplotypeCaller can't.
Thank you in advance.
-
Hi Daisy thanks so much for your question!
When you say GATK4.0 are you specifically referring to GATK 4.0.0? If so, our first recommendation is that you update to the latest version, GATK 4.1.4.1
However, there is a good chance that you will see a similar situation, as it is likely caused by some issue with LocalAssembly. We are currently working on some changes to HaplotypeCaller that might improve the assembly of these sites, and you can try some of our prototype code by running HaplotypeCaller with the --linked-de-bruijn-graph argument to see if it fixes the issue. If you try this, please let us know what you observe!
You can also try running with the argument --debug-graph-transformations on while running on that subset of the genome where those variants were, so if you are still missing those variants, you can provide us with the dot files this argument outputs so we can take a closer look at why those variants are lost.
Please sign in to leave a comment.
1 comment