## Introduction

A couple of years ago, while working on LLD, we noticed that a substantial
part of the link time was spent applying relocations. That was
expected, since applying relocations is the main task of a linker. A
relocation in a `.o`

file tells the linker to compute an
expression and write the result in the final binary. The expression
depends on what the relocation is. For example, for

```
int foo = 42;
int* bar = &foo;
```

`foo`

. There are many other types of relocations. For
example:
`const char* a = "foo";`

`"foo"`

in the string
table section in the `.o`

. The linker has to map that to
the position of `"foo"`

in the combined string table
section in the output binary. This type of relocation is common when
producing binaries with debug info. To avoid looking up the same string
multiple times, the linker does it only once and keeps a direct
mapping from input position to output position.
A natural way to implement that map is a sorted array with binary search. At the time we found that doing that makes that binary search one of the hottest pieces of code in the entire link. We solved the problem by using a hash table, but that increased the memory usage of the linker.

I got curious if there was something better than binary search and found the awesome paper Array Layouts for Comparison-Based Searching. The paper covers various layouts and all the code needed to reproduce the results is available.

For our case, the short answer is that binary search is the best option, but the implementation has to avoid unpredictable branches. The main technique for that is to use conditional moves in the body of the loop, since it is unpredictable whether the element we are looking for is in the first or second half of the array.

I tried that in LLD. By compiling LLD with GCC I found that the binary search can indeed be fast if using conditional moves. Unfortunately, I could not convince LLVM to produce conditional moves instead of branches. I have reported a LLVM bug, but it was never fixed.

The following graph compares binary search with different compilers for arrays that fit in the L3 cache. It was produced with the scripts used in the paper. The benchmark was run on my laptop running OpenBSD-current with a i7-7500U cpu.

When a branch prediction is correct, the execution continues normally and the branch is not more expensive than a regular instruction. When it fails, the cpu has to ignore some instructions it incorrectly started executing. A conditional move, in contrast, will never cause this problem. The value will not be known for some cycles, but that is not different then what happens with, for example, an

`add`

instruction. The number or cycles lost in case of a branch
misprediction is called misprediction penalty.
Lets look at why the branch free version is faster in a particular data point, the largest one that fits L1. That array has about \(2^{13}\) elements, so each search loop will iterate 13 times. The time difference of the two benchmark runs is about 0.12 seconds. Given that the cpu is at 2.8 Ghz and that each run does 2 million searches, the difference in cycles per loop iteration is $$ 0.12 * 2.8 * 10^{9} / ( 2*10^{6} * 13) \approx 13 $$

According to Agner, the "other lakes" cpus have from 15 to 20 cycles misprediction penalties. We expect about half of the branches in the body of a search will be misrpedicted, so it seems the misprediction can account for up to 10 of the 13 lost cycles. My guess is that the rest is just because the branch free code is more compact.

This is interesting, because it shows the need for hiding the branch computation latency even when all the data is in L1. This probably explains why amd64, powerpc64 and aarch64 have a conditional move instruction. The one exception among modern architectures is risc-v. I guess risc-v implementations are supposed so use macro-op fusion, but I don't know if any does.

## Eytzinger

Reading the paper also got me curious about the Eytzinger layout. While it is common to represent a tree implicitly in an array for priority queues, I hadn't seen it used for search before. If it is also the first time you see it, take a look at the paper and come back, it has a very readable and concise description.The code from the paper is fairly simple:

```
template<typename T, typename I>
I eytzinger_array<T,I>::branchfree_search(T x) const {
I i = 0;
while (i < n) {
i = (x <= a[i]) ? (2*i + 1) : (2*i + 2);
}
I j = (i+1) >> __builtin_ffs(~(i+1));
return (j == 0) ? n : j-1;
}
```

`while`

loop, unlike the one in a binary search,
iterates a different number of times depending on the value we looking
for. This creates a bumpiness when benchmarking repeated searches on
the same array.
At the time I thought of a trick to make the code iterate a fixed number of times, but never had the time to test it properly. I finally found time and decided to take a second look, hence this post.

The first thing I did was try to reproduce the graphs. I hit a few issues running the scripts with python 3. I have opened a pull request fixing it. I hope to contribute a bit more once/if that is merged. When I got the graphs, I was surprised to see that the banchy and branch-free Eytzinger variants were way too similar. That turns out to be because of a regression in GCC since the paper was published. Like LLVM with the binary search, current GCC produces branches on the Eytzinger variant that should be branch-free. At least ICC still produces conditional moves:

I have reported a GCC bug. Lets hope we have better luck with this one. In the following comparisons I have used the ICC output for the branch free variants.

## Fixing the bumpiness with Eytzinger

With the compiler issues out of the way, lets see if we can avoid the branch misprediction. The Eytzinger represents a tree where every level but the last is complete. The branch misprediction comes from the last level: sometimes the algorithm has to look at it, sometimes it doesn't. The gist of the improvement is to use the loop only for the complete levels and then handle the last level separately. Since a complete tree with \(n\) levels has \(2^{n-1}\) elements, to stop at the last complete level we just have to replace`n`

in
```
while (i < n) {
```

```
I p2_floor = I(1) << (31 - __builtin_clz(n));
while (i < p2_floor) {
```

`if`

, it becomes:
```
if (i < n) {
i = (x <= a[i]) ? (2*i + 1) : (2*i + 2);
}
```

`if`

would be unpredictable. Most
architectures don't have a conditional load, so we cannot directly
convert the `if`

. What we can do is first conditionally
backtrack so that the load is always valid. A node at index
`i`

has a parent at index `(i - 1)/2`

, and that
we can compute with a conditional move:
```
i = i < n ? i : (i - 1) / 2;
```

`i < n`

and can remove the troublesome
`if`

. The entire function becomes:
```
template<typename T, typename I, bool aligned>
I eytzinger_array<T,I>::predictable_branchfree_search(T x) const {
// The last power of 2 less than or equal to N
I p2_floor = I(1) << (31 - __builtin_clz(n));
I i = 0;
while (i < p2_floor) {
i = (x <= a[i]) ? (2*i + 1) : (2*i + 2);
}
// Go back one step if we already passed N
i = i < n ? i : (i - 1) / 2;
// Handle the last level of the tree
i = (x <= a[i]) ? (2*i + 1) : (2*i + 2);
I j = (i+1) >> __builtin_ffs(~(i+1));
return (j == 0) ? n : j-1;
}
```

It is not clear what causes the remaining advantage of the binary search over Eytzinger. Is it just fewer instructions?

## Conclusion

Unlike the original paper, I have focused on arrays that fit in L3, since that is what I was trying to optimize when working on LLD. For this case, prefetching is not so relevant, so Eytzinger loses its main advantage and binary search is a bit faster.The bigger advantage binary search has right now is that current GCC will produce branch-free binary search, but not a branch-free Eytzinger. Few projects can provide assembly implementations for various architectures or assume that they will be compiled with ICC.

Given that, if your data fits in L3, a branch-free binary search is probably the best option. It also makes sense to try to compact the array as much as possible. For example, if it is an array of structs, move fields that are not used often to another array.

For compilers, there are two different ways to looks at the
situation. I think the more common view is that a branch and a
conditional move are semantically identical. With this view, the
current situation is, at most, a missed optimization. I say "at most"
because it is not clear what the compiler could do. Just blindly
switching to conditional moves would probably regress other cases. One
could add a `__builtin_unpredictable`

, to complement
`__builtin_expect`

or require profile info, but not much
more.

Another way to look at this is to say that branches and conditional
moves are not equivalent: branches have a misprediction penalty,
conditional moves don't, and it is up to the programmer to
decide. With this view what is missing is
`__builtin_branchless_select`

, as suggested in a
comment in the GCC bug. A more radical idea would be to always
follow the programmer: if a ternary operator is used, produce a
conditional move, if not, produce a branch. I wonder what impact on
real world code that would have.