ApplyBQSRSpark ApplyBQSR has different results
I am seeing a different result when i use ApplyBQSR vs ApplyBQSRSpark, using the exact same inputs. I tried with both GATK 4.1.0.0 and 4.1.9.0
i am comparing the command:
gatk ApplyBQSR --java-options '-Xmx28g' --reference human_g1k_v37_decoy.fasta --create-output-bam-index true --bqsr-recal-file sample.recal.table --input sample.md.bam --output sample.bam
to the command:
gatk ApplyBQSRSpark --conf 'spark.executor.cores = 4' --java-options '-Xmx28g' --reference human_g1k_v37_decoy.fasta --create-output-bam-index true --bqsr-recal-file sample.recal.table --input sample.md.bam --output sample.bam
They are exactly the same except for "Spark --conf 'spark.executor.cores = 4'". The input bam and recalibration file are the same. I assessed the result by checking the md5sum of the file excluding the header (`samtools view sample.bam | md5sum`), and found a different checksum between the spark-enabled and spark-less outputs for 14/15 of the samples (we tried 15 samples). I also tried this on two versions of gatk just for good measure.
We were hoping that Spark versions of each tool would produce the same bams. What is causing the difference here? is there any way to force them to be the same?
-
AMN do you have any more information about the extent of these differences?
This has been previously discussed on the forum, though we didn't have a resolution: https://gatk.broadinstitute.org/hc/en-us/community/posts/360073320632-BQSR-Spark-Why-Beta-
-
Hi Genevieve,
We are still looking into the differences. However a preliminary look at the first few lines shows some differences already. The 42nd and 43rd line of the bam:
spark-enabled:
$ samtools view ../spark/sample.bam | head -43 | tail -2
<instrumentID>:292:<flowcellID>:4:1265:20808:18458 133 1 10022 0 * = 10022 0 TAAAAGAATAGTAATAACCATACATTTAAACATACACTAAAACAAACTGTTACTCAAATATTTAAAATTCACTTAAGTATTACTGTAATGATTAAGTAAATTCAAAACAAAATGAATAAGTTTAATAACATCTAAACAGAGAATAATAAA <?+D,H'*,BH<BF<&:+-7,&;(,G>%*:G'E%G'*HB):F*7F:;HIC-;*B>E):E%EG-6:F:+>-7GH-6:I,6+>B*BIC6:EI9E=%F?CB9F;GH'9F9G6FF)E/EFEBF?;--5F+B9G6+=HB9)*6IE08)+B)<6*E MC:Z:87M1I39M PG:Z:MarkDuplicates RG:Z:<instrumentID>@292@<flowcellID>@4 AS:i:0 XS:i:0
<instrumentID>:292:<flowcellID>:4:1265:20808:18458 1097 1 10022 15 87M1I39M = 10022 0 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCCTAACCCTAACCCAAACCATAACCCTAACCCTAACCCTAACCCAAACACTAACCCAAACCCTAAACCAAACCCTAAACCTCAC CDEGBEFF/HBF=FFHBF.FFHBF=FFDBF,=FDB*+-F(FFGF-CBFG-FHBFGFF(FF<F(EBFGF<37F<-FCBFGF<HB;+F-(*FG(<HBFGF-(FF+F-H7F;GF(FF+F,2BF*+FC-E; XA:Z:1,+10070,39M1I38M1D28M1I20M,11;hs37d5,+10061289,39M1I51M36S,4;4,-191043982,38S67M1D22M,4;3,-197900264,43S21M1I62M,3;hs37d5,-10060120,44S41M1I41M,3; MD:Z:39T17T4C27C7T8C2T8C3A2 PG:Z:MarkDuplicates RG:Z:<instrumentID>@292@<flowcellID>@4 NM:i:10 AS:i:77 XS:i:68spark-disabled:
$ samtools view sample.bam | head -43 | tail -2
<instrumentID>:292:<flowcellID>:4:1265:20808:18458 1097 1 10022 15 87M1I39M = 10022 0 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCCTAACCCTAACCCAAACCATAACCCTAACCCTAACCCTAACCCAAACACTAACCCAAACCCTAAACCAAACCCTAAACCTCAC CDEGBEFF/HBF=FFHBF.FFHBF=FFDBF,=FDB*+-F(FFGF-CBFG-FHBFGFF(FF<F(EBFGF<37F<-FCBFGF<HB;+F-(*FG(<HBFGF-(FF+F-H7F;GF(FF+F,2BF*+FC-E; XA:Z:1,+10070,39M1I38M1D28M1I20M,11;hs37d5,+10061289,39M1I51M36S,4;4,-191043982,38S67M1D22M,4;3,-197900264,43S21M1I62M,3;hs37d5,-10060120,44S41M1I41M,3; MD:Z:39T17T4C27C7T8C2T8C3A2 PG:Z:MarkDuplicates RG:Z:<instrumentID>@292@<flowcellID>@4 NM:i:10 AS:i:77 XS:i:68
<instrumentID>:292:<flowcellID>:4:1265:20808:18458 133 1 10022 0 * = 10022 0 TAAAAGAATAGTAATAACCATACATTTAAACATACACTAAAACAAACTGTTACTCAAATATTTAAAATTCACTTAAGTATTACTGTAATGATTAAGTAAATTCAAAACAAAATGAATAAGTTTAATAACATCTAAACAGAGAATAATAAA <?+D,H'*,BH<BF<&:+-7,&;(,G>%*:G'E%G'*HB):F*7F:;HIC-;*B>E):E%EG-6:F:+>-7GH-6:I,6+>B*BIC6:EI9E=%F?CB9F;GH'9F9G6FF)E/EFEBF?;--5F+B9G6+=HB9)*6IE08)+B)<6*E MC:Z:87M1I39M PG:Z:MarkDuplicates RG:Z:<instrumentID>@292@<flowcellID>@4 AS:i:0 XS:i:0seems that they are sorted differently. however, i ran `samtools sort -n sample.bam | samtools view | md5sum` on each bam and found that they had different md5sums, so i am not confident that this is the only difference.
-
Hi AMN,
Please keep us up to date if you find any differences besides the sort order.
We did find a difference in ApplyBQSR and ApplyBQSRSpark, there is an extra sort in ApplyBQSRSpark. Were both the inputs sorted with the same tool?
Let us know what you find.
Best,
Genevieve
Please sign in to leave a comment.
3 comments