prefetch.md (17012B)
1 # Pipelining and prefetching: a 45% speedup story 2 3 In this blog post I want show you an impressive performance optimization 4 trick that I have recently come up with. Well, at least *I* was impressed; 5 your mileage may vary. 6 7 The trick consists in improving the memory 8 access pattern of the code and doing some manual 9 [*prefetching*](https://en.wikipedia.org/wiki/Prefetching). 10 I am not an expert on this topic, so I won't be able to explain 11 in detail how things work under the hood, but I'll try to provide 12 links to other sources whenever possible. For a start, you can have 13 a look at the amazing paper [*What Every Programmer Should Know About 14 Memory*](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf). 15 It is a bit dated, but still very relevant. 16 17 ## tl;dr 18 19 The main computational step in my program is a [depth-first 20 search](https://en.wikipedia.org/wiki/Depth-first_search) on an implicit 21 graph. At each node, a heuristic is used to determine if it is worth 22 continuing the seach from the current node or not. The heuristic uses 23 data that is stored in very large (many gigabytes), precomputed *pruning 24 table*. 25 26 Fetching data from RAM in random order is much slower than 27 other operations. However, if the CPU knows in advance 28 that some data should be fetched, it can *prefetch* 29 it while other operations are performed. There are even 30 [intrinsics](https://en.wikipedia.org/wiki/Intrinsic_function#C_and_C++) 31 that one can use to tell the CPU that it should prefetch some data and 32 keep it in cache. 33 34 My trick boils down to expanding multiple neighbor nodes "in 35 parallel". This way, after the pruning table index for one neighbor is 36 computed, the corresponding value is not immediately accessed and used, 37 as other nodes are expanded first. This gives the CPU some time to 38 prefetch the data. 39 40 This way I obtained a speedup of 33% to 45%. This made my program, as 41 far as I know, the fastest publicly available Rubik's cube optimal solver. 42 43 ## The problem 44 45 The program that I am working on is an *optimal Rubik's cube 46 solver*. That is, a program that can find the shortest possible 47 solution to any given Rubik's Cube configuration. 48 49 This project has been for me an amazing playground for developing 50 [new skills](../2025-04-05-learned-rewrite). It has also been a source 51 of inspiration for many posts in this blog, starting from [*The big 52 rewrite*](../2023-04-10-the-big-rewrite), where I talked about how I 53 decided to start this project as a total rewrite of an earlier tool 54 of mine. 55 56 In this post we are only going to be concerned with the main solving 57 algorithm, the hardcore computational engine of this program. 58 59 You can find the source code 60 on [my git pages](https://git.tronto.net/nissy-core/) and on 61 [GitHub](https://github.com/sebastianotronto/nissy-core). 62 63 ## The solving algorithm 64 65 ### Solution search 66 67 The idea behind the solving algorithm is a simple brute-force tree 68 search: for each position, we make every possible move and check if 69 the configuration thus obtained is solved. If it is not, we proceed 70 recursively. 71 72 This is an oversimplification of course, and at least a couple of 73 optimizations are needed to make this approach even remotely feasible. 74 75 First of all, one may naturally think of 76 implementing this algorithm as a [breadth-first 77 search](https://en.wikipedia.org/wiki/Breadth-first_search), 78 because we are trying to find the shortest possible solution. But 79 a BFS needs to keep a list of all nodes to be processed at the 80 next iteration, and this we are never going to have anough memory 81 for that. So we use instead an [iterative-deepening depth first 82 search](https://en.wikipedia.org/wiki/Iterative_deepening_depth-first_search): 83 a DFS with depth limited to a certain value N that gets increased at 84 each iteration. This way we repeat some work, because a search at depth 85 N+1 includes a full search at depth N, but since the number of nodes 86 increases exponentially with the depth this won't matter much. 87 88 ### Pruning tables and IDA\* 89 90 The second optimization we need is some kind of heuristic that 91 enables us to skip some branches: if, given a cube configuration, 92 we are able to determine a lower bound for the number of moves it 93 takes to solve it, we can stop searching at nodes whose value is 94 too high for the depth we are searching at. 95 96 For example, suppose we are searching for a 10 move solution, 97 and after 3 moves we get a lower bound of 8. With this we 98 know that we won't be able to find a solution shorter than 3 99 + 8 = 11 moves, so we can stop the search from this node. 100 101 An algorithm of this kind is sometimes called [iterative-deepening A\* 102 algorithm](https://en.wikipedia.org/wiki/Iterative_deepening_A*), or IDA\* 103 for short. 104 105 To get a lower bound for each configuration, we employ a large 106 pruning table, which one can think of as a hash map: from the 107 cube configuration we compute an integer (the hash value, or 108 [coordinate](../../speedcubing/coordinates)) which we use as an index 109 in the table. The value at this index is going to be a lower bound for 110 our configuration, and for all other configurations that share the same 111 "hash value". Since the number of valid Rubik's Cube configuration is on 112 the order of 10<sup>19</sup>, we are going to have many "hash collision". 113 But the larger the table is, the more the precise the lower bound can be, 114 and the faster the solver is. 115 116 ### The bottleneck 117 118 Many other optimizations are possible, and in fact my solver uses a 119 bunch of them, but the details are not relevant for this post. What is 120 relevant for us is that, for any reasonably optimized solver, looking up 121 the lower bound values in the pruning table is going to be the bottleneck. 122 Indeed, fetching data from memory is very slow on modern CPUs compared 123 to other operations - on the order of hundreds of times slower than 124 executing simple instructions, such as adding two number. 125 126 Luckily, there are some caveats. First of all, modern CPUs have two or 127 three levels of [cache](https://en.wikipedia.org/wiki/CPU_cache), where 128 recently-fetched data is stored for fast subsequent access. Unfortunately, 129 this won't save us, because we are accessing essentially random addresses 130 in our pruning table. 131 132 But there is a window for optimization. CPUs work in 133 [pipelines](https://en.wikipedia.org/wiki/Pipeline_(computing)), meaning 134 that, in some cases, they can execute different types of instructions 135 in parallel. For example, if the CPU can predict in advance that 136 the program is going to access some data in memory, it can start the 137 (slow) fetching routine while other instructions are still running. 138 In other words, the CPU can take advantage of a predictable [memory 139 access pattern](https://en.wikipedia.org/wiki/Memory_access_pattern). 140 And in some cases we may even be able to give a hint to the CPU on when 141 to start fetching data, using specific prefetching instructions. 142 143 This is what my trick relies on. 144 145 ## A software pipeline for the hardware's pipeline 146 147 ### The old version 148 149 Before this optimization, my code followed a pattern that can be 150 summarized by the following Python-like pseudocode: 151 152 ``` 153 def visit(node): 154 if should_stop(node): 155 return 156 157 if solution_is_reached(node): 158 append_solution(node) 159 return 160 161 for neighbor in node.neighbors: 162 visit(neighbor) 163 164 def should_stop(node): 165 if should_stop_quick_heuristic(node): 166 return true 167 168 index = pruning_table_index(node) 169 return pruning_table[index] > target: 170 ``` 171 172 The actual code is a fair bit more complicated than the pseudo code above, 173 but this should give you a good idea of my earlier implementation. 174 175 As we discussed in the previous sections, the slow part of this code 176 are the last two lines of `should_stop()`: we are trying to access 177 `table[index]` right after `index` was calculated. In my case, the 178 function `pruning_table_index()` does some non-trivial operations, so 179 it is impossible for the CPU to figure out, or even to guess, the value 180 of `index`. Therefore, there is no time for the CPU pipeline to fetch 181 the value of `table[index]` in advance: the code execution will stall, 182 waiting for the value to be retrieved from memory before it can determine 183 whether it is greater than `target` or not. 184 185 ### The new version 186 187 To make this code more CPU-friendly, we need to change the order in 188 which we perform our operations. And we can do this by creating a sort 189 of pipeline of our own, expanding all neighbor nodes at the same time in 190 "stages". Something like this: 191 192 ``` 193 def visit(node): 194 if solution_is_reached(node): 195 append_solution(node) 196 return 197 198 prepared_neighbors = prepare_nodes(node.neighbors) 199 for pn in prepared_neighbors: 200 if not pn.stop: 201 visit(pn.node) 202 203 def prepare_nodes(nodes): 204 prepared_nodes = [] 205 206 # Stage 1: set up the nodes 207 for node in nodes: 208 pn = prepared_node() 209 pn.node = node 210 pn.stop = should_stop_quick_heuristic(node) 211 prepared_nodes.append(pn) 212 213 # Stage 2: compute the index and prefetch 214 for pn in prepared_nodes: 215 if not pn.stop: 216 pn.index = pruning_table_index(pn.node) 217 prefetch(pruning_table, index) 218 219 # Stage 3: access the pruning table value 220 for pn in prepared_nodes: 221 if not pn.stop: 222 pn.stop = pruning_table[pn.index] > target 223 224 return prepared_nodes 225 ``` 226 227 With this new version, we compute `index` for all neighbors well before 228 trying to access `table[index]` for each of them. Each node has from 8 229 to 15 neighbors, and computing all these indeces gives time to the CPU 230 to fetch, or at least start to fetch, the corresponding table values. 231 232 It is interesting to note that, at least in theory, it may look like 233 this new version of the DFS is slower than the previous one, because 234 we are potentially doing more work. Imagine the solution is found while 235 visitng the node that came first in the list of neighbors of the previous 236 node: with the new algorithm, we have "prepared" for visiting all other 237 neighbors too, including the expensive table look ups; but that turned 238 out to be wasted work! 239 240 But in reality the extra work is not much: at most 14 \* depth nodes 241 will be uselessly prepared at each iteration of the IDA\* algorithm. 242 In practice, the time saved by allowing the CPU to prefetch the data is 243 much more significant. 244 245 ### Manual prefetch 246 247 In the pseudocode above I added a call to a `prefetch()` function. 248 What this exactly does depends on the language, the compiler and the 249 CPU architecture. 250 251 Most modern x86\_64 processors support instructions such 252 as `PREFETCHT0`, which can by trigger by the `_mm_prefetch()` 253 [instrinsic](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html). 254 This instruction will tell the CPU to fetch the data stored at a specified 255 address in memory and keep it in cache. There are similar instructions 256 for other CPU architectures, but in practice it is more convenient to 257 rely on the `__builtin_prefetch()` function, supported by GCC, Clang, 258 and possibly other compilers. 259 260 In my case, re-organizing the code alone, without using any explicit 261 prefetch instruction, already gave a substantial 25% to 30% speedup on 262 my x86 machine. Adding an explicit `__mm_prefetch()` resulted in an 263 additional 10% to 20% improvement. See the benchmarks below for details. 264 265 ## Measuring the performance gain 266 267 In the the repository you can find a comprehensive set of benchmarks 268 of the current version of the H48 solver. They include a comparison 269 with [vcube](https://github.com/voltara/vcube) and an analysis of 270 multi-threading performance. In this section, however, we are going to 271 focus on the performance gain obtained by pipelining the IDA\* search. 272 273 The H48 solver supports pruning tables of different size; I have included 274 6 of them in these benchmarks, ranging from 1.8GiB to 56.5GiB. *(Yes, 275 I have 64GB of memory on [my desktop](../2023-10-15-build-time); 276 I bought it before RAM prices went [through the 277 roof](https://en.wikipedia.org/wiki/2024%E2%80%932026_global_memory_supply_shortage).)* 278 279 Since the runtime increases exponentially with each additional move, 280 I like to split the benchmarks by length of the optimal solution. Here 281 I used sets of 25 cube positions of each solution length between 16 and 282 19. I have then used the [approximate distribution for a random Rubik's 283 Cube configuration](https://cube20.org) to calculate the average runtime. 284 In short, about 99.8% of all the configurations require between 16 and 285 19 moves to solve optimally, and are thus covered by our tests. 286 287 ### Old version (no pipeline, no prefetch) 288 289 | Solver | Size |16 moves|17 moves|18 moves|19 moves| 290 |:---------|:-------|-------:|-------:|-------:|-------:| 291 |H48 h11 |56.5 GiB| 0.05 | 0.13| 0.73| 3.43| 292 |H48 h10 |28.3 GiB| 0.06 | 0.23| 1.20| 6.18| 293 |H48 h9 |14.1 GiB| 0.08 | 0.36| 2.47| 11.79| 294 |H48 h8 | 7.1 GiB| 0.12 | 0.79| 6.09| 25.58| 295 |H48 h7 | 3.5 GiB| 0.11 | 1.07| 8.59| 42.16| 296 |H48 h6 | 1.8 GiB| 0.19 | 2.12| 16.29| 82.24| 297 *Time per cube (in seconds, lower is better).* 298 299 ### Pipelined version without manual prefetch 300 301 | Solver | Size |16 moves|17 moves|18 moves|19 moves| 302 |:---------|:-------|-------:|-------:|-------:|-------:| 303 |H48 h11 |56.5 GiB| 0.04| 0.10| 0.55| 2.41| 304 |H48 h10 |28.3 GiB| 0.05| 0.16| 0.89| 4.33| 305 |H48 h9 |14.1 GiB| 0.06| 0.29| 1.82| 8.26| 306 |H48 h8 | 7.1 GiB| 0.11| 0.59| 4.37| 17.97| 307 |H48 h7 | 3.5 GiB| 0.10| 0.79| 6.01| 28.98| 308 |H48 h6 | 1.8 GiB| 0.18| 1.56| 11.72| 57.92| 309 *Time per cube (in seconds, lower is better).* 310 311 ### Pipelined version with manual prefetch 312 313 | Solver | Size |16 moves|17 moves|18 moves|19 moves| 314 |:---------|:-------|-------:|-------:|-------:|-------:| 315 |H48 h11 |56.5 GiB| 0.04 | 0.09| 0.50| 2.24| 316 |H48 h10 |28.3 GiB| 0.04 | 0.15| 0.76| 3.36| 317 |H48 h9 |14.1 GiB| 0.05 | 0.24| 1.48| 6.69| 318 |H48 h8 | 7.1 GiB| 0.09 | 0.46| 3.36| 14.13| 319 |H48 h7 | 3.5 GiB| 0.09 | 0.63| 4.85| 23.25| 320 |H48 h6 | 1.8 GiB| 0.14 | 1.25| 9.45| 46.31| 321 *Time per cube (in seconds, lower is better).* 322 323 ### Result summary 324 325 Here are the derived results for random cube positions: 326 327 | Solver | Size |Old version|Pipelined|Pipelined + prefetch| 328 |:---------|:-------|----------:|--------:|-------------------:| 329 |H48 h11 |56.5 GiB| 0.64 | 0.48 | 0.43 | 330 |H48 h10 |28.3 GiB| 1.07 | 0.78 | 0.66 | 331 |H48 h9 |14.1 GiB| 2.14 | 1.57 | 1.28 | 332 |H48 h8 | 7.1 GiB| 5.14 | 3.68 | 2.84 | 333 |H48 h7 | 3.5 GiB| 7.44 | 5.20 | 4.10 | 334 |H48 h6 | 1.8 GiB| 14.22 | 10.20 | 8.21 | 335 *Time per cube (in seconds, lower is better).* 336 337  338 339 These results imply the following speed ups when compared to the old, 340 non-pipelined version: 341 342 | Solver | Size |Pipelined|Pipelined + prefetch| 343 |:---------|:-------|--------:|-------------------:| 344 |H48 h11 |56.5 GiB| 25% | 33% | 345 |H48 h10 |28.3 GiB| 27% | 38% | 346 |H48 h9 |14.1 GiB| 27% | 40% | 347 |H48 h8 | 7.1 GiB| 28% | 45% | 348 |H48 h7 | 3.5 GiB| 30% | 45% | 349 |H48 h6 | 1.8 GiB| 28% | 42% | 350 *Percentage, higher is better* 351 352 As you can see, the improvements are less meaningful for larger 353 solvers. This makes sense, because larger tables provide a more precise 354 heuristic, leading to fewer table lookups. Looking at the complete tables 355 above, we can also see that the speedups are larger when the solutions are 356 longer, once again because longer solutions require more table lookups. 357 358 As we have previously noted, most of the benefit (25% to 30%) comes from 359 pipelining the algorithm, while adding a manual prefetch instruction 360 gives another 10% to 20% on top of that. This means that, once the 361 code is organized nicely from a memory access point of view, the CPU 362 can figure out what data to prefetch decently well. However, manual 363 prefetch still give a significant improvement over the CPU's predictions. 364 365 ## Conclusion 366 367 I am very proud of this optimization: it required some work, but in the 368 end it paid off quite well. I am also happy that I was able to learn 369 and put into practice concepts that for me were relatively new. 370 371 I guess this is a good example of how low-level optimization actually 372 requires reorganizing how an algorithm works on a higher level of 373 abstraction. Memory access is so ubiquitous that we rarely think about 374 it, but it can often be the bottleneck; when it is, you may not be able 375 to optimize by just working locally. 376 377 The technique I used here could be pushed further, by extending the 378 pruning pipeline to neighboring nodes of the next node to be expanded. 379 But is quite tricky, as it requires breaking up the recursive 380 structure of the DFS. This goes in the direction of implementing 381 [micro-threading](https://en.wikipedia.org/wiki/Micro-thread_(multi-core)), 382 a technique that was pointed out to me for the first time by Tomas 383 Rokicki. If I end up implementing this too, I will definitely make a 384 dedicated post about it :)