prefetch.md - sebastiano.tronto.net - Source files and build scripts for my personal website

prefetch.md (17012B)
      1 # Pipelining and prefetching: a 45% speedup story
      2 
      3 In this blog post I want show you an impressive performance optimization
      4 trick that I have recently come up with. Well, at least *I* was impressed;
      5 your mileage may vary.
      6 
      7 The trick consists in improving the memory
      8 access pattern of the code and doing some manual
      9 [*prefetching*](https://en.wikipedia.org/wiki/Prefetching).
     10 I am not an expert on this topic, so I won't be able to explain
     11 in detail how things work under the hood, but I'll try to provide
     12 links to other sources whenever possible. For a start, you can have
     13 a look at the amazing paper [*What Every Programmer Should Know About
     14 Memory*](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf).
     15 It is a bit dated, but still very relevant.
     16 
     17 ## tl;dr
     18 
     19 The main computational step in my program is a [depth-first
     20 search](https://en.wikipedia.org/wiki/Depth-first_search) on an implicit
     21 graph. At each node, a heuristic is used to determine if it is worth
     22 continuing the seach from the current node or not. The heuristic uses
     23 data that is stored in very large (many gigabytes), precomputed *pruning
     24 table*.
     25 
     26 Fetching data from RAM in random order is much slower than
     27 other operations.  However, if the CPU knows in advance
     28 that some data should be fetched, it can *prefetch*
     29 it while other operations are performed. There are even
     30 [intrinsics](https://en.wikipedia.org/wiki/Intrinsic_function#C_and_C++)
     31 that one can use to tell the CPU that it should prefetch some data and
     32 keep it in cache.
     33 
     34 My trick boils down to expanding multiple neighbor nodes "in
     35 parallel". This way, after the pruning table index for one neighbor is
     36 computed, the corresponding value is not immediately accessed and used,
     37 as other nodes are expanded first. This gives the CPU some time to
     38 prefetch the data.
     39 
     40 This way I obtained a speedup of 33% to 45%. This made my program, as
     41 far as I know, the fastest publicly available Rubik's cube optimal solver.
     42 
     43 ## The problem
     44 
     45 The program that I am working on is an *optimal Rubik's cube
     46 solver*.  That is, a program that can find the shortest possible
     47 solution to any given Rubik's Cube configuration.
     48 
     49 This project has been for me an amazing playground for developing
     50 [new skills](../2025-04-05-learned-rewrite). It has also been a source
     51 of inspiration for many posts in this blog, starting from [*The big
     52 rewrite*](../2023-04-10-the-big-rewrite), where I talked about how I
     53 decided to start this project as a total rewrite of an earlier tool
     54 of mine.
     55 
     56 In this post we are only going to be concerned with the main solving
     57 algorithm, the hardcore computational engine of this program.
     58 
     59 You can find the source code
     60 on [my git pages](https://git.tronto.net/nissy-core/) and on
     61 [GitHub](https://github.com/sebastianotronto/nissy-core).
     62 
     63 ## The solving algorithm
     64 
     65 ### Solution search
     66 
     67 The idea behind the solving algorithm is a simple brute-force tree
     68 search: for each position, we make every possible move and check if
     69 the configuration thus obtained is solved. If it is not, we proceed
     70 recursively.
     71 
     72 This is an oversimplification of course, and at least a couple of
     73 optimizations are needed to make this approach even remotely feasible.
     74 
     75 First of all, one may naturally think of
     76 implementing this algorithm as a [breadth-first
     77 search](https://en.wikipedia.org/wiki/Breadth-first_search),
     78 because we are trying to find the shortest possible solution. But
     79 a BFS needs to keep a list of all nodes to be processed at the
     80 next iteration, and this we are never going to have anough memory
     81 for that. So we use instead an [iterative-deepening depth first
     82 search](https://en.wikipedia.org/wiki/Iterative_deepening_depth-first_search):
     83 a DFS with depth limited to a certain value N that gets increased at
     84 each iteration. This way we repeat some work, because a search at depth
     85 N+1 includes a full search at depth N, but since the number of nodes
     86 increases exponentially with the depth this won't matter much.
     87 
     88 ### Pruning tables and IDA\*
     89 
     90 The second optimization we need is some kind of heuristic that
     91 enables us to skip some branches: if, given a cube configuration,
     92 we are able to determine a lower bound for the number of moves it
     93 takes to solve it, we can stop searching at nodes whose value is
     94 too high for the depth we are searching at.
     95 
     96 For example, suppose we are searching for a 10 move solution,
     97 and after 3 moves we get a lower bound of 8. With this we
     98 know that we won't be able to find a solution shorter than 3
     99 + 8 = 11 moves, so we can stop the search from this node.
    100 
    101 An algorithm of this kind is sometimes called [iterative-deepening A\*
    102 algorithm](https://en.wikipedia.org/wiki/Iterative_deepening_A*), or IDA\*
    103 for short.
    104 
    105 To get a lower bound for each configuration, we employ a large
    106 pruning table, which one can think of as a hash map: from the
    107 cube configuration we compute an integer (the hash value, or
    108 [coordinate](../../speedcubing/coordinates)) which we use as an index
    109 in the table. The value at this index is going to be a lower bound for
    110 our configuration, and for all other configurations that share the same
    111 "hash value".  Since the number of valid Rubik's Cube configuration is on
    112 the order of 10<sup>19</sup>, we are going to have many "hash collision".
    113 But the larger the table is, the more the precise the lower bound can be,
    114 and the faster the solver is.
    115 
    116 ### The bottleneck
    117 
    118 Many other optimizations are possible, and in fact my solver uses a
    119 bunch of them, but the details are not relevant for this post. What is
    120 relevant for us is that, for any reasonably optimized solver, looking up
    121 the lower bound values in the pruning table is going to be the bottleneck.
    122 Indeed, fetching data from memory is very slow on modern CPUs compared
    123 to other operations - on the order of hundreds of times slower than
    124 executing simple instructions, such as adding two number.
    125 
    126 Luckily, there are some caveats. First of all, modern CPUs have two or
    127 three levels of [cache](https://en.wikipedia.org/wiki/CPU_cache), where
    128 recently-fetched data is stored for fast subsequent access. Unfortunately,
    129 this won't save us, because we are accessing essentially random addresses
    130 in our pruning table.
    131 
    132 But there is a window for optimization. CPUs work in
    133 [pipelines](https://en.wikipedia.org/wiki/Pipeline_(computing)), meaning
    134 that, in some cases, they can execute different types of instructions
    135 in parallel. For example, if the CPU can predict in advance that
    136 the program is going to access some data in memory, it can start the
    137 (slow) fetching routine while other instructions are still running.
    138 In other words, the CPU can take advantage of a predictable [memory
    139 access pattern](https://en.wikipedia.org/wiki/Memory_access_pattern).
    140 And in some cases we may even be able to give a hint to the CPU on when
    141 to start fetching data, using specific prefetching instructions.
    142 
    143 This is what my trick relies on.
    144 
    145 ## A software pipeline for the hardware's pipeline
    146 
    147 ### The old version
    148 
    149 Before this optimization, my code followed a pattern that can be
    150 summarized by the following Python-like pseudocode:
    151 
    152 ```
    153 def visit(node):
    154 	if should_stop(node):
    155 		return
    156 
    157 	if solution_is_reached(node):
    158 		append_solution(node)
    159 		return
    160 
    161 	for neighbor in node.neighbors:
    162 		visit(neighbor)
    163 
    164 def should_stop(node):
    165 	if should_stop_quick_heuristic(node):
    166 		return true
    167 
    168 	index = pruning_table_index(node)
    169 	return pruning_table[index] > target:
    170 ```
    171 
    172 The actual code is a fair bit more complicated than the pseudo code above,
    173 but this should give you a good idea of my earlier implementation.
    174 
    175 As we discussed in the previous sections, the slow part of this code
    176 are the last two lines of `should_stop()`: we are trying to access
    177 `table[index]` right after `index` was calculated. In my case, the
    178 function `pruning_table_index()` does some non-trivial operations, so
    179 it is impossible for the CPU to figure out, or even to guess, the value
    180 of `index`. Therefore, there is no time for the CPU pipeline to fetch
    181 the value of `table[index]` in advance: the code execution will stall,
    182 waiting for the value to be retrieved from memory before it can determine
    183 whether it is greater than `target` or not.
    184 
    185 ### The new version
    186 
    187 To make this code more CPU-friendly, we need to change the order in
    188 which we perform our operations. And we can do this by creating a sort
    189 of pipeline of our own, expanding all neighbor nodes at the same time in
    190 "stages". Something like this:
    191 
    192 ```
    193 def visit(node):
    194 	if solution_is_reached(node):
    195 		append_solution(node)
    196 		return
    197 
    198 	prepared_neighbors = prepare_nodes(node.neighbors)
    199 	for pn in prepared_neighbors:
    200 		if not pn.stop:
    201 			visit(pn.node)
    202 
    203 def prepare_nodes(nodes):
    204 	prepared_nodes = []
    205 
    206 	# Stage 1: set up the nodes
    207 	for node in nodes:
    208 		pn = prepared_node()
    209 		pn.node = node
    210 		pn.stop = should_stop_quick_heuristic(node)
    211 		prepared_nodes.append(pn)
    212 
    213 	# Stage 2: compute the index and prefetch
    214 	for pn in prepared_nodes:
    215 		if not pn.stop:
    216 			pn.index = pruning_table_index(pn.node)
    217 			prefetch(pruning_table, index)
    218 
    219 	# Stage 3: access the pruning table value
    220 	for pn in prepared_nodes:
    221 		if not pn.stop:
    222 			pn.stop = pruning_table[pn.index] > target
    223 
    224 	return prepared_nodes
    225 ```
    226 
    227 With this new version, we compute `index` for all neighbors well before
    228 trying to access `table[index]` for each of them. Each node has from 8
    229 to 15 neighbors, and computing all these indeces gives time to the CPU
    230 to fetch, or at least start to fetch, the corresponding table values.
    231 
    232 It is interesting to note that, at least in theory, it may look like
    233 this new version of the DFS is slower than the previous one, because
    234 we are potentially doing more work. Imagine the solution is found while
    235 visitng the node that came first in the list of neighbors of the previous
    236 node: with the new algorithm, we have "prepared" for visiting all other
    237 neighbors too, including the expensive table look ups; but that turned
    238 out to be wasted work!
    239 
    240 But in reality the extra work is not much: at most 14 \* depth nodes
    241 will be uselessly prepared at each iteration of the IDA\* algorithm.
    242 In practice, the time saved by allowing the CPU to prefetch the data is
    243 much more significant.
    244 
    245 ### Manual prefetch
    246 
    247 In the pseudocode above I added a call to a `prefetch()` function.
    248 What this exactly does depends on the language, the compiler and the
    249 CPU architecture.
    250 
    251 Most modern x86\_64 processors support instructions such
    252 as `PREFETCHT0`, which can by trigger by the `_mm_prefetch()`
    253 [instrinsic](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html).
    254 This instruction will tell the CPU to fetch the data stored at a specified
    255 address in memory and keep it in cache. There are similar instructions
    256 for other CPU architectures, but in practice it is more convenient to
    257 rely on the `__builtin_prefetch()` function, supported by GCC, Clang,
    258 and possibly other compilers.
    259 
    260 In my case, re-organizing the code alone, without using any explicit
    261 prefetch instruction, already gave a substantial 25% to 30% speedup on
    262 my x86 machine.  Adding an explicit `__mm_prefetch()` resulted in an
    263 additional 10% to 20% improvement. See the benchmarks below for details.
    264 
    265 ## Measuring the performance gain
    266 
    267 In the the repository you can find a comprehensive set of benchmarks
    268 of the current version of the H48 solver. They include a comparison
    269 with [vcube](https://github.com/voltara/vcube) and an analysis of
    270 multi-threading performance. In this section, however, we are going to
    271 focus on the performance gain obtained by pipelining the IDA\* search.
    272 
    273 The H48 solver supports pruning tables of different size; I have included
    274 6 of them in these benchmarks, ranging from 1.8GiB to 56.5GiB. *(Yes,
    275 I have 64GB of memory on [my desktop](../2023-10-15-build-time);
    276 I bought it before RAM prices went [through the
    277 roof](https://en.wikipedia.org/wiki/2024%E2%80%932026_global_memory_supply_shortage).)*
    278 
    279 Since the runtime increases exponentially with each additional move,
    280 I like to split the benchmarks by length of the optimal solution. Here
    281 I used sets of 25 cube positions of each solution length between 16 and
    282 19. I have then used the [approximate distribution for a random Rubik's
    283 Cube configuration](https://cube20.org) to calculate the average runtime.
    284 In short, about 99.8% of all the configurations require between 16 and
    285 19 moves to solve optimally, and are thus covered by our tests.
    286 
    287 ### Old version (no pipeline, no prefetch)
    288 
    289 |  Solver  |  Size  |16 moves|17 moves|18 moves|19 moves|
    290 |:---------|:-------|-------:|-------:|-------:|-------:|
    291 |H48 h11   |56.5 GiB|   0.05 |    0.13|    0.73|    3.43|
    292 |H48 h10   |28.3 GiB|   0.06 |    0.23|    1.20|    6.18|
    293 |H48 h9    |14.1 GiB|   0.08 |    0.36|    2.47|   11.79|
    294 |H48 h8    | 7.1 GiB|   0.12 |    0.79|    6.09|   25.58|
    295 |H48 h7    | 3.5 GiB|   0.11 |    1.07|    8.59|   42.16|
    296 |H48 h6    | 1.8 GiB|   0.19 |    2.12|   16.29|   82.24|
    297 *Time per cube (in seconds, lower is better).*
    298 
    299 ### Pipelined version without manual prefetch
    300 
    301 |  Solver  |  Size  |16 moves|17 moves|18 moves|19 moves|
    302 |:---------|:-------|-------:|-------:|-------:|-------:|
    303 |H48 h11   |56.5 GiB|    0.04|    0.10|    0.55|    2.41|
    304 |H48 h10   |28.3 GiB|    0.05|    0.16|    0.89|    4.33|
    305 |H48 h9    |14.1 GiB|    0.06|    0.29|    1.82|    8.26|
    306 |H48 h8    | 7.1 GiB|    0.11|    0.59|    4.37|   17.97|
    307 |H48 h7    | 3.5 GiB|    0.10|    0.79|    6.01|   28.98|
    308 |H48 h6    | 1.8 GiB|    0.18|    1.56|   11.72|   57.92|
    309 *Time per cube (in seconds, lower is better).*
    310 
    311 ### Pipelined version with manual prefetch
    312 
    313 |  Solver  |  Size  |16 moves|17 moves|18 moves|19 moves|
    314 |:---------|:-------|-------:|-------:|-------:|-------:|
    315 |H48 h11   |56.5 GiB|   0.04 |    0.09|    0.50|    2.24|
    316 |H48 h10   |28.3 GiB|   0.04 |    0.15|    0.76|    3.36|
    317 |H48 h9    |14.1 GiB|   0.05 |    0.24|    1.48|    6.69|
    318 |H48 h8    | 7.1 GiB|   0.09 |    0.46|    3.36|   14.13|
    319 |H48 h7    | 3.5 GiB|   0.09 |    0.63|    4.85|   23.25|
    320 |H48 h6    | 1.8 GiB|   0.14 |    1.25|    9.45|   46.31|
    321 *Time per cube (in seconds, lower is better).*
    322 
    323 ### Result summary
    324 
    325 Here are the derived results for random cube positions:
    326 
    327 |  Solver  |  Size  |Old version|Pipelined|Pipelined + prefetch|
    328 |:---------|:-------|----------:|--------:|-------------------:|
    329 |H48 h11   |56.5 GiB|      0.64 |    0.48 |               0.43 |
    330 |H48 h10   |28.3 GiB|      1.07 |    0.78 |               0.66 |
    331 |H48 h9    |14.1 GiB|      2.14 |    1.57 |               1.28 |
    332 |H48 h8    | 7.1 GiB|      5.14 |    3.68 |               2.84 |
    333 |H48 h7    | 3.5 GiB|      7.44 |    5.20 |               4.10 |
    334 |H48 h6    | 1.8 GiB|     14.22 |   10.20 |               8.21 |
    335 *Time per cube (in seconds, lower is better).*
    336 
    337 ![](./plot_all.png)
    338 
    339 These results imply the following speed ups when compared to the old,
    340 non-pipelined version:
    341 
    342 |  Solver  |  Size  |Pipelined|Pipelined + prefetch|
    343 |:---------|:-------|--------:|-------------------:|
    344 |H48 h11   |56.5 GiB|     25% |                33% |
    345 |H48 h10   |28.3 GiB|     27% |                38% |
    346 |H48 h9    |14.1 GiB|     27% |                40% |
    347 |H48 h8    | 7.1 GiB|     28% |                45% |
    348 |H48 h7    | 3.5 GiB|     30% |                45% |
    349 |H48 h6    | 1.8 GiB|     28% |                42% |
    350 *Percentage, higher is better*
    351 
    352 As you can see, the improvements are less meaningful for larger
    353 solvers. This makes sense, because larger tables provide a more precise
    354 heuristic, leading to fewer table lookups. Looking at the complete tables
    355 above, we can also see that the speedups are larger when the solutions are
    356 longer, once again because longer solutions require more table lookups.
    357 
    358 As we have previously noted, most of the benefit (25% to 30%) comes from
    359 pipelining the algorithm, while adding a manual prefetch instruction
    360 gives another 10% to 20% on top of that. This means that, once the
    361 code is organized nicely from a memory access point of view, the CPU
    362 can figure out what data to prefetch decently well.  However, manual
    363 prefetch still give a significant improvement over the CPU's predictions.
    364 
    365 ## Conclusion
    366 
    367 I am very proud of this optimization: it required some work, but in the
    368 end it paid off quite well. I am also happy that I was able to learn
    369 and put into practice concepts that for me were relatively new.
    370 
    371 I guess this is a good example of how low-level optimization actually
    372 requires reorganizing how an algorithm works on a higher level of
    373 abstraction. Memory access is so ubiquitous that we rarely think about
    374 it, but it can often be the bottleneck; when it is, you may not be able
    375 to optimize by just working locally.
    376 
    377 The technique I used here could be pushed further, by extending the
    378 pruning pipeline to neighboring nodes of the next node to be expanded.
    379 But is quite tricky, as it requires breaking up the recursive
    380 structure of the DFS. This goes in the direction of implementing
    381 [micro-threading](https://en.wikipedia.org/wiki/Micro-thread_(multi-core)),
    382 a technique that was pointed out to me for the first time by Tomas
    383 Rokicki.  If I end up implementing this too, I will definitely make a
    384 dedicated post about it :)
	sebastiano.tronto.net Source files and build scripts for my personal website
	git clone https://git.tronto.net/sebastiano.tronto.net
	Download \| Log \| Files \| Refs \| README