A second look at array layouts

Introduction

A couple of years ago, while working on LLD, we noticed that a substantial part of the link time was spent applying relocations. That was expected, since applying relocations is the main task of a linker. A relocation in a .o file tells the linker to compute an expression and write the result in the final binary. The expression depends on what the relocation is. For example, for

int foo = 42;
int* bar = &foo;

The expression will be just the symbol value of foo. There are many other types of relocations. For example:

const char* a = "foo";

Produces a relocation that points to "foo" in the string table section in the .o. The linker has to map that to the position of "foo" in the combined string table section in the output binary. This type of relocation is common when producing binaries with debug info. To avoid looking up the same string multiple times, the linker does it only once and keeps a direct mapping from input position to output position.

A natural way to implement that map is a sorted array with binary search. At the time we found that doing that makes that binary search one of the hottest pieces of code in the entire link. We solved the problem by using a hash table, but that increased the memory usage of the linker.

I got curious if there was something better than binary search and found the awesome paper Array Layouts for Comparison-Based Searching. The paper covers various layouts and all the code needed to reproduce the results is available.

For our case, the short answer is that binary search is the best option, but the implementation has to avoid unpredictable branches. The main technique for that is to use conditional moves in the body of the loop, since it is unpredictable whether the element we are looking for is in the first or second half of the array.

I tried that in LLD. By compiling LLD with GCC I found that the binary search can indeed be fast if using conditional moves. Unfortunately, I could not convince LLVM to produce conditional moves instead of branches. I have reported a LLVM bug, but it was never fixed.

The following graph compares binary search with different compilers for arrays that fit in the L3 cache. It was produced with the scripts used in the paper. The benchmark was run on my laptop running OpenBSD-current with a i7-7500U cpu.

When a branch prediction is correct, the execution continues normally and the branch is not more expensive than a regular instruction. When it fails, the cpu has to ignore some instructions it incorrectly started executing. A conditional move, in contrast, will never cause this problem. The value will not be known for some cycles, but that is not different then what happens with, for example, an add instruction. The number or cycles lost in case of a branch misprediction is called misprediction penalty.

Lets look at why the branch free version is faster in a particular data point, the largest one that fits L1. That array has about $2^{13}$ elements, so each search loop will iterate 13 times. The time difference of the two benchmark runs is about 0.12 seconds. Given that the cpu is at 2.8 Ghz and that each run does 2 million searches, the difference in cycles per loop iteration is $$ 0.12 * 2.8 * 10^{9} / ( 2*10^{6} * 13) \approx 13 $$

According to Agner, the "other lakes" cpus have from 15 to 20 cycles misprediction penalties. We expect about half of the branches in the body of a search will be misrpedicted, so it seems the misprediction can account for up to 10 of the 13 lost cycles. My guess is that the rest is just because the branch free code is more compact.

This is interesting, because it shows the need for hiding the branch computation latency even when all the data is in L1. This probably explains why amd64, powerpc64 and aarch64 have a conditional move instruction. The one exception among modern architectures is risc-v. I guess risc-v implementations are supposed so use macro-op fusion, but I don't know if any does.

Eytzinger

Reading the paper also got me curious about the Eytzinger layout. While it is common to represent a tree implicitly in an array for priority queues, I hadn't seen it used for search before. If it is also the first time you see it, take a look at the paper and come back, it has a very readable and concise description.

The code from the paper is fairly simple:

template<typename T, typename I>
I eytzinger_array<T,I>::branchfree_search(T x) const {
	I i = 0;
	while (i < n) {
		i = (x <= a[i]) ? (2*i + 1) : (2*i + 2);
	}
	I j = (i+1) >> __builtin_ffs(~(i+1));
	return (j == 0) ? n : j-1;
}

But as the paper points out, it has one disadvantage: The while loop, unlike the one in a binary search, iterates a different number of times depending on the value we looking for. This creates a bumpiness when benchmarking repeated searches on the same array.

At the time I thought of a trick to make the code iterate a fixed number of times, but never had the time to test it properly. I finally found time and decided to take a second look, hence this post.

The first thing I did was try to reproduce the graphs. I hit a few issues running the scripts with python 3. I have opened a pull request fixing it. I hope to contribute a bit more once/if that is merged. When I got the graphs, I was surprised to see that the banchy and branch-free Eytzinger variants were way too similar. That turns out to be because of a regression in GCC since the paper was published. Like LLVM with the binary search, current GCC produces branches on the Eytzinger variant that should be branch-free. At least ICC still produces conditional moves:

I have reported a GCC bug. Lets hope we have better luck with this one. In the following comparisons I have used the ICC output for the branch free variants.

Fixing the bumpiness with Eytzinger

With the compiler issues out of the way, lets see if we can avoid the branch misprediction. The Eytzinger represents a tree where every level but the last is complete. The branch misprediction comes from the last level: sometimes the algorithm has to look at it, sometimes it doesn't. The gist of the improvement is to use the loop only for the complete levels and then handle the last level separately. Since a complete tree with $n$ levels has $2^{n-1}$ elements, to stop at the last complete level we just have to replace n in

while (i < n) {

With the last power of 2 less than or equal to n. That can be computed efficiently by looking for the most significant bit that is 1:

I p2_floor = I(1) << (31 - __builtin_clz(n));
while (i < p2_floor) {

We now have to handle the last level. If we convert the last loop iteration to an if, it becomes:

if (i < n) {
	i = (x <= a[i]) ? (2*i + 1) : (2*i + 2);
}

But now that if would be unpredictable. Most architectures don't have a conditional load, so we cannot directly convert the if. What we can do is first conditionally backtrack so that the load is always valid. A node at index i has a parent at index (i - 1)/2, and that we can compute with a conditional move:

i = i < n ? i : (i - 1) / 2;

We now know that i < n and can remove the troublesome if. The entire function becomes:

template<typename T, typename I, bool aligned>
I  eytzinger_array<T,I>::predictable_branchfree_search(T x) const {
	// The last power of 2 less than or equal to N
	I p2_floor = I(1) << (31 - __builtin_clz(n));
	I i = 0;
	while (i < p2_floor) {
		i = (x <= a[i]) ? (2*i + 1) : (2*i + 2);
	}
	// Go back one step if we already passed N
	i = i < n ? i : (i - 1) / 2;
	// Handle the last level of the tree
	i = (x <= a[i]) ? (2*i + 1) : (2*i + 2);
	I j = (i+1) >> __builtin_ffs(~(i+1));
	return (j == 0) ? n : j-1;
}

To test that I compiled it with ICC to avoid the compiler introduced branches. The results are in the graph bellow. The new code is the "predictable branch-free" line. It fixes the bumpiness of the original branch-free implementation.

It is not clear what causes the remaining advantage of the binary search over Eytzinger. Is it just fewer instructions?

Conclusion

Unlike the original paper, I have focused on arrays that fit in L3, since that is what I was trying to optimize when working on LLD. For this case, prefetching is not so relevant, so Eytzinger loses its main advantage and binary search is a bit faster.

The bigger advantage binary search has right now is that current GCC will produce branch-free binary search, but not a branch-free Eytzinger. Few projects can provide assembly implementations for various architectures or assume that they will be compiled with ICC.

Given that, if your data fits in L3, a branch-free binary search is probably the best option. It also makes sense to try to compact the array as much as possible. For example, if it is an array of structs, move fields that are not used often to another array.

For compilers, there are two different ways to looks at the situation. I think the more common view is that a branch and a conditional move are semantically identical. With this view, the current situation is, at most, a missed optimization. I say "at most" because it is not clear what the compiler could do. Just blindly switching to conditional moves would probably regress other cases. One could add a __builtin_unpredictable, to complement __builtin_expect or require profile info, but not much more.

Another way to look at this is to say that branches and conditional moves are not equivalent: branches have a misprediction penalty, conditional moves don't, and it is up to the programmer to decide. With this view what is missing is __builtin_branchless_select, as suggested in a comment in the GCC bug. A more radical idea would be to always follow the programmer: if a ternary operator is used, produce a conditional move, if not, produce a branch. I wonder what impact on real world code that would have.

Acknowledgments

Thanks to Guilherme Ottoni for correcting a misunderstanding I had about conditional moves.