Introduction
A couple of years ago, while working on LLD, we noticed that a substantial part of the link time was spent applying relocations. That was expected, since applying relocations is the main task of a linker. A relocation in a.o
file tells the linker to compute an
expression and write the result in the final binary. The expression
depends on what the relocation is. For example, for
int foo = 42;
int* bar = &foo;
foo
. There are many other types of relocations. For
example:
const char* a = "foo";
"foo"
in the string
table section in the .o
. The linker has to map that to
the position of "foo"
in the combined string table
section in the output binary. This type of relocation is common when
producing binaries with debug info. To avoid looking up the same string
multiple times, the linker does it only once and keeps a direct
mapping from input position to output position.
A natural way to implement that map is a sorted array with binary
search. At the time we found that doing that makes that binary search
one of the hottest pieces of code in the entire link. We solved the
problem by using a hash table, but that increased the memory usage of
the linker.
I got curious if there was something better than binary search and
found the awesome
paper Array Layouts for
Comparison-Based Searching. The paper covers various layouts and
all the code needed to reproduce the results
is available.
For our case, the short answer is that binary search is the best
option, but the implementation has to avoid unpredictable
branches. The main technique for that is to use conditional moves in
the body of the loop, since it is unpredictable whether the element we
are looking for is in the first or second half of the array.
I tried that in LLD. By compiling LLD with GCC I found that the binary
search can indeed be fast if using conditional moves. Unfortunately, I
could not convince LLVM to produce conditional moves instead of
branches. I have reported a LLVM bug, but
it was never fixed.
The following graph compares binary search with different compilers
for arrays that fit in the L3 cache. It was produced with the scripts
used in the paper. The benchmark was run on my laptop running
OpenBSD-current with a i7-7500U cpu.
add
instruction. The number or cycles lost in case of a branch
misprediction is called misprediction penalty.
Lets look at why the branch free version is faster in a particular
data point, the largest one that fits L1. That array has about
\(2^{13}\) elements, so each search loop will iterate 13 times. The
time difference of the two benchmark runs is about 0.12 seconds. Given
that the cpu is at 2.8 Ghz and that each run does 2 million searches,
the difference in cycles per loop iteration is
$$
0.12 * 2.8 * 10^{9} / ( 2*10^{6} * 13) \approx 13
$$
According to Agner,
the "other lakes" cpus have from 15 to 20 cycles misprediction
penalties. We expect about half of the branches in the body of a
search will be misrpedicted, so it seems the misprediction can account
for up to 10 of the 13 lost cycles. My guess is that the rest is just
because the branch free code is more compact.
This is interesting, because it shows the need for hiding the branch
computation latency even when all the data is in L1. This probably
explains why amd64, powerpc64 and aarch64 have a conditional move
instruction. The one exception among modern architectures is risc-v. I
guess risc-v implementations are supposed so use macro-op fusion, but
I don't know if any does.
Eytzinger
Reading the paper also got me curious about the Eytzinger layout. While it is common to represent a tree implicitly in an array for priority queues, I hadn't seen it used for search before. If it is also the first time you see it, take a look at the paper and come back, it has a very readable and concise description. The code from the paper is fairly simple:template<typename T, typename I>
I eytzinger_array<T,I>::branchfree_search(T x) const {
I i = 0;
while (i < n) {
i = (x <= a[i]) ? (2*i + 1) : (2*i + 2);
}
I j = (i+1) >> __builtin_ffs(~(i+1));
return (j == 0) ? n : j-1;
}
while
loop, unlike the one in a binary search,
iterates a different number of times depending on the value we looking
for. This creates a bumpiness when benchmarking repeated searches on
the same array.
At the time I thought of a trick to make the code iterate a fixed
number of times, but never had the time to test it properly. I finally
found time and decided to take a second look, hence this post.
The first thing I did was try to reproduce the graphs. I hit a few
issues running the scripts with python 3. I have opened a pull request
fixing it. I hope to contribute a bit more once/if that is
merged. When I got the graphs, I was surprised to see that the banchy
and branch-free Eytzinger variants were way too similar. That turns
out to be because of a regression in GCC since the paper was
published. Like LLVM with the binary search, current GCC produces
branches on the Eytzinger variant that should be branch-free. At least
ICC still produces
conditional moves:
Fixing the bumpiness with Eytzinger
With the compiler issues out of the way, lets see if we can avoid the branch misprediction. The Eytzinger represents a tree where every level but the last is complete. The branch misprediction comes from the last level: sometimes the algorithm has to look at it, sometimes it doesn't. The gist of the improvement is to use the loop only for the complete levels and then handle the last level separately. Since a complete tree with \(n\) levels has \(2^{n-1}\) elements, to stop at the last complete level we just have to replacen
in
while (i < n) {
I p2_floor = I(1) << (31 - __builtin_clz(n));
while (i < p2_floor) {
if
, it becomes:
if (i < n) {
i = (x <= a[i]) ? (2*i + 1) : (2*i + 2);
}
if
would be unpredictable. Most
architectures don't have a conditional load, so we cannot directly
convert the if
. What we can do is first conditionally
backtrack so that the load is always valid. A node at index
i
has a parent at index (i - 1)/2
, and that
we can compute with a conditional move:
i = i < n ? i : (i - 1) / 2;
i < n
and can remove the troublesome
if
. The entire function becomes:
template<typename T, typename I, bool aligned>
I eytzinger_array<T,I>::predictable_branchfree_search(T x) const {
// The last power of 2 less than or equal to N
I p2_floor = I(1) << (31 - __builtin_clz(n));
I i = 0;
while (i < p2_floor) {
i = (x <= a[i]) ? (2*i + 1) : (2*i + 2);
}
// Go back one step if we already passed N
i = i < n ? i : (i - 1) / 2;
// Handle the last level of the tree
i = (x <= a[i]) ? (2*i + 1) : (2*i + 2);
I j = (i+1) >> __builtin_ffs(~(i+1));
return (j == 0) ? n : j-1;
}
Conclusion
Unlike the original paper, I have focused on arrays that fit in L3, since that is what I was trying to optimize when working on LLD. For this case, prefetching is not so relevant, so Eytzinger loses its main advantage and binary search is a bit faster. The bigger advantage binary search has right now is that current GCC will produce branch-free binary search, but not a branch-free Eytzinger. Few projects can provide assembly implementations for various architectures or assume that they will be compiled with ICC. Given that, if your data fits in L3, a branch-free binary search is probably the best option. It also makes sense to try to compact the array as much as possible. For example, if it is an array of structs, move fields that are not used often to another array. For compilers, there are two different ways to looks at the situation. I think the more common view is that a branch and a conditional move are semantically identical. With this view, the current situation is, at most, a missed optimization. I say "at most" because it is not clear what the compiler could do. Just blindly switching to conditional moves would probably regress other cases. One could add a__builtin_unpredictable
, to complement
__builtin_expect
or require profile info, but not much
more.
Another way to look at this is to say that branches and conditional
moves are not equivalent: branches have a misprediction penalty,
conditional moves don't, and it is up to the programmer to
decide. With this view what is missing is
__builtin_branchless_select
, as suggested in a
comment in the GCC bug. A more radical idea would be to always
follow the programmer: if a ternary operator is used, produce a
conditional move, if not, produce a branch. I wonder what impact on
real world code that would have.