@@ -56,39 +56,48 @@ julia> sum(big.(x))
5656# But the standard summation algorithms computes this sum very inaccurately
5757# (not even the sign is correct)
5858julia> sum (x)
59- - 136 .0
59+ - 8 .0
6060
6161
6262# Compensated summation algorithms should compute this more accurately
6363julia> using AccurateArithmetic
6464
6565# Algorithm by Ogita, Rump and Oishi
6666julia> sum_oro (x)
67- 0.9999999999999716
67+ 1.0000000000000084
6868
6969# Algorithm by Kahan, Babuska and Neumaier
7070julia> sum_kbn (x)
71- 0.9999999999999716
71+ 1.0000000000000084
7272```
7373
7474
75- ![ ] ( test/figs/qual.svg )
75+ ![ ] ( test/figs/sum_accuracy.svg )
76+ ![ ] ( test/figs/dot_accuracy.svg )
7677
77- In the graph above, we see the relative error vary as a function of the
78+ In the graphs above, we see the relative error vary as a function of the
7879condition number, in a log-log scale. Errors lower than ϵ are arbitrarily set to
7980ϵ; conversely, when the relative error is more than 100% (i.e no digit is
8081correctly computed anymore), the error is capped there in order to avoid
81- affecting the scale of the graph too much. What we see is that the pairwise
82+ affecting the scale of the graph too much. What we see on the left is that the pairwise
8283summation algorithm (as implemented in Base.sum) starts losing accuracy as soon
8384as the condition number increases, computing only noise when the condition
84- number exceeds 1/ϵ≃10¹⁶. In contrast, both compensated algorithms
85+ number exceeds 1/ϵ≃10¹⁶. The same goes for the naive summation algorithm.
86+ In contrast, both compensated algorithms
8587(Kahan-Babuska-Neumaier and Ogita-Rump-Oishi) still accurately compute the
8688result at this point, and start losing accuracy there, computing meaningless
8789results when the condition nuber reaches 1/ϵ²≃10³². In effect these (simply)
8890compensated algorithms produce the same results as if a naive summation had been
8991performed with twice the working precision (128 bits in this case), and then
9092rounded to 64-bit floats.
9193
94+ The same comments can be made for the dot product implementations shown on the
95+ right. Uncompensated algorithms, as implemented in
96+ ` AccurateArithmetic.dot_naive ` or ` Base.dot ` (which internally calls BLAS in
97+ this case) exhibit typical loss of accuracy. In contrast, the implementation of
98+ Ogita, Rump & Oishi's compentated algorithm effectively doubles the working
99+ precision.
100+
92101<br />
93102
94103Performancewise, compensated algorithms perform a lot better than alternatives
@@ -97,24 +106,31 @@ such as arbitrary precision (`BigFloat`) or rational arithmetic (`Rational`) :
97106``` julia
98107julia> using BenchmarkTools
99108
109+ julia> length (x)
110+ 10001
111+
100112julia> @btime sum ($ x)
101- 1.305 μs (0 allocations: 0 bytes)
102- - 136.0
113+ 1.320 μs (0 allocations: 0 bytes)
114+ - 8.0
115+
116+ julia> @btime sum_naive ($ x)
117+ 1.026 μs (0 allocations: 0 bytes)
118+ - 1.121325337906356
103119
104120julia> @btime sum_oro ($ x)
105- 3.421 μs (0 allocations: 0 bytes)
106- 0.9999999999999716
121+ 3.348 μs (0 allocations: 0 bytes)
122+ 1.0000000000000084
107123
108124julia> @btime sum_kbn ($ x)
109- 3.792 μs (0 allocations: 0 bytes)
110- 0.9999999999999716
125+ 3.870 μs (0 allocations: 0 bytes)
126+ 1.0000000000000084
111127
112- julia> @btime sum (big .($ x ))
113- 874.203 μs (20006 allocations: 1.14 MiB )
128+ julia> @btime sum ($ ( big .(x) ))
129+ 437.495 μs (2 allocations: 112 bytes )
1141301.0
115131
116- julia> @btime sum (Rational {BigInt} .(x))
117- 22.702 ms (582591 allocations: 10.87 MiB)
132+ julia> @btime sum ($ ( Rational {BigInt} .(x) ))
133+ 10.894 ms (259917 allocations: 4.76 MiB)
1181341 // 1
119135```
120136
@@ -124,32 +140,37 @@ than their naive floating-point counterparts. As such, they usually perform
124140worse. However, leveraging the power of modern architectures via vectorization,
125141the slow down can be kept to a small value.
126142
127- ![ ] ( test/figs/perf.svg )
128-
129- In the graph above, the time spent in the summation (renormalized per element)
130- is plotted against the vector size (the units in the y-axis label should be
131- “ns/elem”). What we see with the standard summation is that, once vectors start
132- having significant sizes (say, more than 1000 elements), the implementation is
133- memory bound (as expected of a typical BLAS1 operation). Which is why we see
134- significant decreases in the performance when the vector can’t fit into the L2
135- cache (around 30k elements, or 256kB on my machine) or the L3 cache (around 400k
136- elements, or 3MB on y machine).
137-
138- The Ogita-Rump-Oishi algorithm, when implemented with a suitable unrolling level
139- (ushift=2, i.e 2²=4 unrolled iterations), is CPU-bound when vectors fit inside
140- the cache. However, when vectors are to large to fit into the L3 cache, the
141- implementation becomes memory-bound again (on my system), which means we get the
142- same performance as the standard summation.
143+ ![ ] ( test/figs/sum_performance.svg )
144+ ![ ] ( test/figs/dot_performance.svg )
145+
146+ Benchmarks presented in the above graphs were obtained in an Intel® Xeon® Gold
147+ 6128 CPU @ 3.40GHz. The time spent in the summation (renormalized per element)
148+ is plotted against the vector size. What we see with the standard summation is
149+ that, once vectors start having significant sizes (say, more than a few
150+ thousands of elements), the implementation is memory bound (as expected of a
151+ typical BLAS1 operation). Which is why we see significant decreases in the
152+ performance when the vector can’t fit into the L1, L2 or L3 cache.
153+
154+ On this AVX512-enabled system, the Kahan-Babuska-Neumaier implementation tends
155+ to be a little more efficient than the Ogita-Rump-Oishi algorithm (this would
156+ generally the opposite for AVX2 systems). When implemented with a suitable
157+ unrolling level and cache prefetching, these implementations are CPU-bound when
158+ vectors fit inside the L1 or L2 cache. However, when vectors are too large to
159+ fit into the L2 cache, the implementation becomes memory-bound again (on this
160+ system), which means we get the same performance as the standard
161+ summation. Again, the same could be said as well for dot product calculations
162+ (graph on the right), where the implementations from ` AccurateArithmetic.jl `
163+ compete against MKL's dot product.
143164
144165In other words, the improved accuracy is free for sufficiently large
145- vectors. For smaller vectors, the accuracy comes with a slow-down that can reach
146- values slightly above 3 for vectors which fit in the L2 cache.
166+ vectors. For smaller vectors, the accuracy comes with a slow-down by a factor of
167+ approximately 3 in the L2 cache.
147168
148169
149170### Tests
150171
151172The graphs above can be reproduced using the ` test/perftests.jl ` script in this
152- repository. Before running them, be aware that it takes around one hour to
173+ repository. Before running them, be aware that it takes around tow hours to
153174generate the performance graph, during which the benchmark machine should be as
154175low-loaded as possible in order to avoid perturbing performance measurements.
155176
0 commit comments