nissy-core

The "engine" of nissy, including the H48 optimal solver.
git clone https://git.tronto.net/nissy-core
Download | Log | Files | Refs | README | LICENSE

h48.md (23837B)


      1 # The H48 optimal solver
      2 
      3 This document contains information on the H48 Rubik's Cube optimal solver.
      4 This solver is occasionally improved and optimized, and this document
      5 is updated accordingly to reflect the current implementation.
      6 
      7 I highly encourage the reader to check out Jaap Scherphuis'
      8 [Computer Puzzling page](https://www.jaapsch.net/puzzles/compcube.htm)
      9 before reading this document.
     10 
     11 Other backround material that will be referenced throughout the text includes:
     12 
     13 * [Cube coordinates](https://sebastiano.tronto.net/speedcubing/coordinates)
     14   by Sebastiano Tronto
     15 * [Nxopt](https://github.com/rokicki/cube20src/blob/master/nxopt.md)
     16   by Tomas Rokicki
     17 * [The Mathematics behind Cube Explorer](https://kociemba.org/cube.htm)
     18   by Herber Kociemba
     19 
     20 Initially I intended to give minimum information here, citing the
     21 above resources when needed, but I ended up rewriting even some of the
     22 basic things in this document.
     23 
     24 ## Optimal solvers
     25 
     26 ### IDA*
     27 
     28 The basic idea behind the solver is an iterative-deepening
     29 depth-first search with pruning tables, also known as IDA*. You
     30 can find a detailed explanation of this idea in
     31 [Jaap's page](https://www.jaapsch.net/puzzles/compcube.htm#tree).
     32 
     33 To summarize, the IDA* algorithm finds the shortest solution(s) for a
     34 scrambled puzzle by using multiple depth-first searches at increasing
     35 depth. A breadth-first search is not applicable in this scenario because
     36 of the large amount of memory required.  At each depth, the puzzle's state
     37 is evaluated to obtain an estimated lower bound for how many moves are
     38 required to solve the puzzle; this is done by employing large pruning
     39 tables, as well as other techniques. If the lower bound exceeds the
     40 number of moves required to reach the current target depth, the branch is
     41 pruned. Otherwise, every possible move is tried, except those that would
     42 "cancel" with the preceding moves, obtaining new positions at depth N+1.
     43 
     44 ### Pruning tables and coordinates
     45 
     46 A pruning table associates to a cube position a value that is a
     47 lower bound for the number of moves it takes to solve that position.
     48 To acces this value, the cube must be turned into an index for the
     49 given table. This is done using a
     50 [*coordinate*](https://sebastiano.tronto.net/speedcubing/coordinates),
     51 by which we mean a function from the set of (admissible, solvable) cube
     52 positions to the set of non-negative integers. The maximum value N of a
     53 coordinate must be less than the size of the pruning table; preferably
     54 there are no "gaps", that is the coordinate reaches all values between
     55 0 and N and the pruning table has size N+1.
     56 
     57 Coordinates are often derived from the cosets of a *target subgroup*.
     58 As an example, consider the pruning table that has one entry for each
     59 possible corner position, and the stored values denote the length of
     60 an optimal corner solution for that position. In this situation, the
     61 target subgroup is the set of all cube positions with solved corners
     62 (for the Mathematician: this is indeed a subgroup of the group of all
     63 cube positions). This group consists of `12! * 2^10` positions.  The set
     64 of cosets of this subgroup can be identified with the set of corner
     65 configurations, and it has size `8! * 3^7`.  By looking at corners,
     66 from each cube position we can compute a number from `0` to `8! * 3^7 - 1`
     67 and use it as an index for the pruning table.
     68 
     69 The pruning table can be filled in many ways. A common way is starting
     70 with the coordinate of the solved cube (often 0) and visiting the set
     71 of all coordinates with a breadth-first search. To do this, one needs
     72 to apply a cube move to a coordinate; this can either be done directly
     73 (for example using transistion tables, as explained in Jaap's page) or
     74 by converting the coordinate back into a valid cube position, applying
     75 the move and then computing the coordinate of the new position.
     76 
     77 The largest a pruning table is, the more information it contains,
     78 the faster the solvers that uses it is. It is therefore convenient to
     79 compress the tables as much as possible. This can be achieved by using
     80 4 bits per position to store the lower bound, or as little as 2 bits
     81 per position with some caveats (see Jaap's page or Rokicki's nxopt
     82 description for two possible ways to achieve this).  Tables that use
     83 only 1 bit per position are possible, but the only information they
     84 can provide is wether the given position requires more or less than a
     85 certain number of moves to be solved.
     86 
     87 ### Symmetries
     88 
     89 Another, extremely effective way of reducing the size of a pruning table
     90 is using symmetries. In the example above of the corner table, one can
     91 notice that all corner positions that are one quarter-turn away from being
     92 solved are essentially the same: they can be turned one into the other
     93 by rotating the cube and optionally applying a mirroring transformation.
     94 Indeed, every position belongs to a set of up to 48 positions that are
     95 "essentially the same" in this way.  Keeping a separate entry in the
     96 pruning table for each of these positions is a waste of memory.
     97 
     98 There are ways to reduce a coordinate by symmetry, grouping every
     99 transformation-equivalent position into the same index. You can
    100 find a description in my
    101 [cube coordinates page](https://sebastiano.tronto.net/speedcubing/coordinates).
    102 By doing so, the size of the pruning table is reduced without loss of
    103 information.
    104 
    105 Some well-known solvers (Cube Explorer, nxopt) do not take advantage
    106 of the full group of 48 symmetries of the cube, but they use only the
    107 16 symmetries that fix the UD axis. However, they make up for this by
    108 doing 3 pruning table lookups, one for each axis of the cube.
    109 
    110 Some cube positions are self-symmetric: when applying certain
    111 transformations to them, they remain the same. For example, the
    112 cube obtained by applying U2 to a solved cube is invariant with
    113 respect to the mirroring along the RL-plane. This fact has a couple
    114 of consequences:
    115 
    116 * Reducing by a group of N symmetries does not reduce the size of
    117   the pruning table by a factor of N, but by a slightly smaller
    118   factor. If the size of the coordinate is large, this fact is
    119   harmless.
    120 * When combining symmetry-reduced coordinates with other coordinates,
    121   one has to be extra careful with handling self-symmetric coordinates.
    122   An informal explanation of this phenomenon can be found in
    123   the [docs/nasty-symmetries.md document](./nasty-symmetries.md).
    124 
    125 ## The H48 solver
    126 
    127 ### The target subgroup: H48 coordinates
    128 
    129 The H48 solver uses a target group that is invariant under all 48
    130 symmetries. This group is defined as follows:
    131 
    132 * Each corner is placed in the *corner tetrad* it belongs to, and it
    133   is oriented. Here by corner tetrad we mean one of the two sets of
    134   corners {UFR, UBL, DFL, DBR} and {UFL, UBR, DFR, DBL}. For a corner
    135   that is placed in its own tetrad, the orientation does not depend on
    136   the reference axis chosen, and an oriented corner can be defined
    137   has having the U or D sticker facing U or D.
    138 * Each edge is placed in the *slice* it belongs to. The three edge slices
    139   are M = {UF, UB, DB, DF}, S = {UR, UL, DL, DR} and E = {FR, FL, BL, BR}.
    140 
    141 Other options are available for corners: for example, we could
    142 impose that the corner permutation is even or that corners are in
    143 *half-turn reduction* state, that is they are in the group generated
    144 by {U2, D2, R2, L2, F2, B2}. We settled on the corner group described
    145 above because it is easy to calculate and it gives enough options
    146 for pruning table sizes, depending on how edge orientation is
    147 considered (see the next section).
    148 
    149 We call the coordinate obtained from the target subgroup described
    150 above the **h0** coordinate. This coordinate has `C(8,4) * 3^7 *
    151 C(12,8) * C(8,4) = 5.304.568.500` possible values.  If the corner
    152 part of the coordinate is reduced by symmetry, it consists of only
    153 3393 elements, giving a total of `3393 * C(12,8) * C(8,4) =
    154 117.567.450` values. This is not very large for a pruning table,
    155 as with 2 bits per entry it would occupy less than 30MB. However,
    156 it is possible to take edge orientation (partially) into account
    157 to produce larger tables.
    158 
    159 ### Edge orientation
    160 
    161 Like for corners, the orientation of an edge is well-defined independently
    162 of any axis when said edge is placed in the slice it belongs to. We may
    163 modify the target subgroup defined in the previous section by imposing
    164 that the edges are all oriented. This yields a new coordinate, which we
    165 call **h11**, whose full size with symmetry-reduced corners is `3393 *
    166 C(12,8) * C(8,4) * 2^11 = 240.778.137.600`. A pruning table based on
    167 this coordinate with 2 bits per entry would occupy around 60GB, which
    168 is a little too much for most personal computers.
    169 
    170 One can wonder if it is possible to use a coordinate that considers
    171 the orientation of only *some* of the edges, which we may call **h1**
    172 to **h10**. Such coordinates do exist, but they are not invariant under
    173 the full symmetry group: indeed an edge whose orientation we keep track
    174 of could be moved to any of the untracked edges' positions by one of
    175 the symmetries, making the whole coordinate ill-defined.
    176 
    177 It is however possible to compute the symmetry-reduced pruning tables
    178 for these coordinates. One way to construct them is by taking the **h11**
    179 pruning table and "forgetting" about some of the edge orientation values,
    180 collapsing 2 (or a power thereof) values to one by taking the minimum.
    181 It is also possible to compute these tables directly, as explained in
    182 the **Pruning table computation** section below.
    183 
    184 ### Coordinate computation for pruning value estimation
    185 
    186 In order to access the pruning value for a given cube position, we first
    187 need to compute its coordinate value. Let's use for simplicity the **h0**
    188 coordinate, but everything will be valid for any other coordinate of
    189 the type described in the previous section.
    190 
    191 As described in the symmetric-composite coordinate section of
    192 [my coordinates page](https://sebastiano.tronto.net/speedcubing/coordinates),
    193 we first need to compute the part of the coordinate where the symmetry
    194 is applied, that is the corner separation + orientation (also called
    195 *cocsep*). The value of the coordinate depends on the equivalence
    196 class of the corner configuration, and is memorized in a table called
    197 `cocsepdata`. To avoid lengthy computations, the index for a cube position
    198 in this table is determined by the corner orientation (between `0` and
    199 `3^7-1`) and a binary representation of the corner separation part (an
    200 8-bit number with 4 zero digits and 4 one digits, where zeros denote
    201 corners in one tetrad and 1 denotes corners in the other; the last bit
    202 can be ignored, as it can easily be deduced from the other 7). This is
    203 slightly less space-efficient than computing the actual corner-separation
    204 coordinate as a value between `0` and `C(8,4)`, but it is much faster.
    205 
    206 We can now access the value in the `cocsepdata` table corresponding to
    207 our cube position. This table contains the value of the symmetry-reduced
    208 coordinate and the so-called *ttrep* (transformation to representative)
    209 necessary to compute the full **h0** coordinate. For convenience,
    210 this table also stores a preliminary pruning value based on the corner
    211 coordinate.  If this value is large enough, we may skip the computation of
    212 the **h0** coordinate and any further estimate. These three values can be
    213 stored in a single 32 bit integer, so that the table uses less than 12MB.
    214 
    215 ### Getting the pruning value
    216 
    217 Once we have computed the full h0 coordinate, we can access
    218 the correct entry in the full pruning table. We chose to
    219 use 2 bits per entry, as described by Rokicki in the [nxopt
    220 document](https://github.com/rokicki/cube20src/blob/master/nxopt.md).
    221 This means that every pruning table needs to have an associated *base
    222 value*, that determines the offset to be added to each entry (each entry
    223 can only be 0, 1, 2 or 3).  If the base value is `b`, a pruning value
    224 of 1, 2 or 3 can be used directly as a lower bound of b+1, b+2 and b+3
    225 respectively. However, a value of 0 could mean that the actual lower
    226 bound is anything between 0 and b, so we cannot take b as a lower bound.
    227 
    228 #### Fallback tables
    229 
    230 To be able to still use some sort of pruning value even when we get a
    231 0 read, we use a **fallback table**. Inspired by nxopt, this table is
    232 interleaved with the main table for cache efficiency: every 254 entries
    233 (508 bits), we store in 4 bits the minimum of the pruning values of these
    234 254 entries as a number from 0 to 16, without subtracting the table's
    235 base. Since a cache line is 512 bits long, we never get a cache miss
    236 when looking up the minimum value in the fallback table after a 0 read.
    237 Smaller lines (of 256 or 128 bits) have been tried, but they do not
    238 give any significant improvement over 512 bit lines.
    239 
    240 This trick provides the gratest performance benefits if the main pruning
    241 table is properly aligned. Unfortunately, as a design choice, the
    242 solver is implemented here as a library that defer all memory allocation
    243 business to the implementor. We do make sure that the table is properly
    244 aligned in the programs provided in this repository (for example, the
    245 rudimentary shell and the tools), but we do not enforce this in the
    246 main library code.
    247 
    248 #### Additional (fast) lookups
    249 
    250 Moreover, as an additional heuristic, in case of a 0 read we also look
    251 up another pruning value in a table that takes into account only the
    252 position of the edges. This table is small (around 1MB), so repeated
    253 accesses to it are not too slow. In practice, this gives a small speed up
    254 of around 5% for random scrambles. However, for small solvers (low values
    255 of h) and positions where corners are close to solved, this fallback
    256 table gives dramatic improvements (up to a factor of 1000x in some
    257 manual tests).
    258 
    259 More tables could be used to refine the fallback estimate, but each
    260 additional table leads to longer lookup times, especially if it is too
    261 large to fit in cache.  Previous versions of this implementation (up
    262 to commit 6c42463, or to version 0.2) also included the possibility of
    263 a table with 4 bits per entry, at least for the `h0` case. Such tables
    264 did not require a fallback lookup, but due to their large size they were
    265 not less efficient. Therefore, they have been removed.
    266 
    267 ### Estimation refinements
    268 
    269 After computing the pruning value, there are a number of different tricks
    270 that can be used to improve the estimation.
    271 
    272 #### Inverse estimate
    273 
    274 A cube position and its inverse will, in general, give a different
    275 pruning value. If the normal estimate is not enough to prune the branch,
    276 the inverse of the position can be computed and a new estimate can be
    277 obtained from there.
    278 
    279 #### Reducing the branching factor - tight bounds
    280 
    281 *This trick and the next are and adaptation of a similar technique
    282 introduced by Rokicki in nxopt.*
    283 
    284 We can take advantage of the fact that the **h0** (and **h11**) coordinate
    285 is invariant under the subgroup `<U2, D2, R2, L2, F2, B2>`. Suppose we
    286 are looking for a solution of length `m`, we are at depth `d` and the
    287 pruning value `p` *for the inverse position* is a strict bound, that is
    288 `d+p = m`.  In this situation, from the inverse point of view the solution
    289 cannot end with a 180° move: if this was the case, since these moves are
    290 contained in the target group for **h0** (or **h11**), it must be that
    291 we were in the target subgroup as well *before the last move*, i.e. we
    292 have found a solution for the **h0** coordinate of length `p-1`. But this
    293 is in contradiction with the fact that the inverse pruning value is `p`.
    294 
    295 From all of this, we conclude that trying any 180° move as next move is
    296 useless (because it would be the last move from the inverse position).
    297 We can therefore reduce the branching factor significantly and only try
    298 quarter-turn moves.
    299 
    300 #### Reducing the branching factor - switching to inverse
    301 
    302 We can expand on the previous trick by using a technique similar to NISS.
    303 
    304 Suppose that we have a strict bound for the *normal position*. As above,
    305 we can deduce that the last move cannot be a 180° move, but this does not
    306 tell us anything about the possibilities for the next move. However, if
    307 we *switch to the inverse scramble* we can take advantage of this fact
    308 as described above. For doing this, we need to replace the cube with its
    309 inverse, and keep track of the moves done from now on so that we can
    310 invert them at the end to construct the final solution.
    311 
    312 When this technique is used, it is also possible to avoid some table
    313 lookups: if the last move applied to the cube is a 180° move *on the
    314 inverse position*, then the coordinate on the normal position has not
    315 changed. Thus if we keep track of the last computed pruning value,
    316 we can reuse it and avoid an expensive table lookup.
    317 
    318 ### Pruning pipeline and prefetching
    319 
    320 To improve the memory access pattern and exploit
    321 [prefetching](https://en.wikipedia.org/wiki/Cache_prefetching)
    322 opportunities, we don't expand each neighbor one by one in the
    323 pruning phase. Instead, we employ a *pruning pipeline*, where we
    324 first compute the cube and inverse position for each neighbor, and
    325 then we proceed through a series of stages, at each of which we
    326 compute some prune off some nodes and we pre-compute the data
    327 necessary for the next step.
    328 
    329 This pipeline-based strategy gives an speedup of around 25%.
    330 Adding manual prefetching, we get an additional 20% speedup.
    331 
    332 ### Other optimizations
    333 
    334 The H48 solver uses various other optimizations.
    335 
    336 #### Avoiding inverse cube computation
    337 
    338 Computing the inverse of a cube position is expensive. We avoid doing
    339 that (for the inverse pruning value estimate) if we bring along both
    340 the normal and the inverse cube during the search, and we use *premoves*
    341 to apply moves to the inverse scramble.
    342 
    343 #### Multi-threading
    344 
    345 The solution search is parallelized *per-scramble*. Although when solving
    346 multiple positions it would be more efficient to solve them in parallel
    347 by dedicating a single thread to each of them, the current H48 solver
    348 optimizes for solving a single cube at the time.
    349 
    350 To do this, we first compute all cube positions that are 4 moves away
    351 from the scramble. This way we prepare up to 43254 *tasks* (depending
    352 on the symmetries of the starting position, which we take advantage of)
    353 which are equally split between the available threads.  Any solution
    354 encountered in this step is of course added to the list of solutions.
    355 
    356 #### Filtering out symmetric cases
    357 
    358 If the cube position we wish to solve is, say, mirrored along the RL
    359 plane, it is redundant to attempt starting a solution with both R and L',
    360 as they would lead to mirrored versions of the same solution. Therefore,
    361 we filter out some tasks by looking for symmetries before each of the
    362 first 4 moves. Then, unless we are looking for only one solution, we
    363 have to make sure to report back all symmetric variations of a solution,
    364 so we need to keep track of which transformations were available at each
    365 of these 4 points.
    366 
    367 This can reduce the number of starting positions drastically: for
    368 example, for the superflip our method produces 1038 starting tasks -
    369 a 41.67x improvement! Of course this optimization has no effect on any
    370 position that isn't 3 moves or fewer away from a symmetric position -
    371 which is the vast majority of them.
    372 
    373 #### Heuristically sorting tasks
    374 
    375 The tasks described in the previous paragraph (multi-threading) are
    376 initially searched in an arbitrary order. However, after searching
    377 at a sufficient depth, we have gathered some data that allows us to
    378 make some heuristical improvements: the tasks that leads to visiting
    379 more positions (or in other words, where we go over the estimated lower
    380 bounds less often), are more likely to yield the optimal solution. Thus
    381 we sort the tasks based on this, adjusting by a small factor due to the
    382 fact that sequences ending in U, R or F moves have more continuations
    383 than those ending in D, L or B moves - as we don't allow, for example,
    384 both U D and D U, but only the former.
    385 
    386 Preliminary benchmarks show a performance improvement of around 40%
    387 when searching a single solution. When searching for multiple optimal
    388 solutions the effects of this optimization will be less pronounced, and
    389 they are obviously inexistent when looking for *all* optimal solutions.
    390 
    391 ## Pruning table computation
    392 
    393 We first describe how the pruning table would be computed if we stored
    394 each pruning value fully, using 4 bits per entry. This algorithm used
    395 to be implemented (up to commit 6c42463), but has since been removed.
    396 
    397 ### Full tables for h0 and h11 (not implemented)
    398 
    399 If we are allowed to store the full pruning value for each entry,
    400 computing the pruning tables is not that difficult.  First, we set the
    401 value of the solved cube's coordinate to 0 and every other entry to 15,
    402 the largest 4-bit integer. Then we iteratively scan through the table and
    403 compute the coordinates at depth n+1 from those at depth n as follows:
    404 for each coordinate at depth n, we compute a valid representative for it,
    405 we apply each of the possible 18 moves to it, and we set their value to
    406 n+1. Once the table is filled or we have reached depth 15, we stop.
    407 
    408 Unfortunately what I described above is an oversimplification: one
    409 also must take into account the case where the corner coordinate is
    410 self-symmetric, and handle it accordingly by making sure that every
    411 symmetric variation of the same coordinate has its value set at the
    412 right iteration.
    413 
    414 When this algorithm reaches the last few iterations, the coordinates at
    415 depth n are many more than those whose depth is still unknown. Therefore
    416 it is convenient to look at the unset coordinate and check if they have
    417 any neighbor whose depth is i; if so, we can set such a coordinate to i+1.
    418 Thanks to [torchlight](https://github.com/torchlight) for suggesting
    419 this optimization.
    420 
    421 Finally, this algorithm can be easily parallelized by dividing the set
    422 of coordinates into separate sections, but one must take into account
    423 that a coordinate and its neighbors are usually not in the same section.
    424 
    425 ### 2 bit tables
    426 
    427 Computing the pruning tables for intermediate coordinates is not as
    428 simple.  The reason is that these coordinates are not invariant under
    429 the 48 transformations of the cube, but we are treating them as such.
    430 Take for example the case of **h10**. Each **h10** coordinate is
    431 represented by either of two **h11** coordinates: the one where the 11th
    432 edge is correctly oriented and the one where this edge is flipped (of
    433 course, the orientation of the 12th edge can be deduced from the parity
    434 of the orientation of the other 11). We can take any cube whose **h11**
    435 coordinate is one of these two to represent our **h10** coordinate, but
    436 the set of neighbors will change depending on which representative we
    437 picked. In other words: if I take one of the two cubes and make a move,
    438 the position reached is not necessarily obtainable by applying a move
    439 to the other cube. This means that the same algorithm that we described
    440 for the "real coordinate" case cannot be applied here.
    441 
    442 It is still possible to do a brute-force depth-first seach to fill
    443 these tables.  To make it reasonably fast, we apply a couple of
    444 optimizations. First of all, we restrict ourselves to computing a 2 bits
    445 table, so we can stop the search early.
    446 
    447 The second optimization we employ consists in pre-computing all possible
    448 **h11** coordinates at a fixed depth, and storing their depth in a hash
    449 map indexed by their **h11** coordinate value. This way we do not have
    450 to brute-force our way through from depth 0, and it make this method
    451 considerably faster. Unfortunately, since the number of coordinates
    452 at a certain depth increases exponentially with the depth, we are for
    453 now limited at storing the coordinates at depth 8.
    454 
    455 This method works for the "real" coordinates **h0** and **h11** as well.
    456 Moreover, in this case one can optimize it further by avoiding to repeat
    457 the search from a coordinate that has already been visited. Further
    458 optimization are possible for **h0** and **h11**, and we may implement
    459 them in the future.