Improve multithreading benchmarks (compare speedup factor, not with vcube) - nissy-core - The "engine" of nissy, including the H48 optimal solver.

commit 233169d9ab5fb8a796d91a3694f8a522fa54d194
parent 74299c8919cb2e2f6e1678817529e3a730cf6b81
Author: Sebastiano Tronto <sebastiano@tronto.net>
Date:   Mon, 26 Jan 2026 22:52:15 +0100

Improve multithreading benchmarks (compare speedup factor, not with vcube)

Diffstat:
M benchmarks/benchmarks.md  | 159 ++++++++++++++++++++++++++++++++++++-------------------------------------------
A benchmarks/img/16threads.png  | 0 
D benchmarks/img/17moves16threads.png  | 0 
M benchmarks/img/17moves1thread.png  | 0 
D benchmarks/img/17moves4threads.png  | 0 
D benchmarks/img/18moves16threads.png  | 0 
M benchmarks/img/18moves1thread.png  | 0 
D benchmarks/img/18moves4threads.png  | 0 
D benchmarks/img/19moves16threads.png  | 0 
M benchmarks/img/19moves1thread.png  | 0 
D benchmarks/img/19moves4threads.png  | 0 
D benchmarks/img/20moves16threads.png  | 0 
M benchmarks/img/20moves1thread.png  | 0 
D benchmarks/img/20moves4threads.png  | 0 
A benchmarks/img/4threads.png  | 0 
M benchmarks/plot-benchmarks.py  | 226 +++++++++++++++++++++++++++++++++++++++++++------------------------------------
D benchmarks/results_h48_clang.py  | 36 ------------------------------------
A benchmarks/tables_16_threads.md  | 27 +++++++++++++++++++++++++++
A benchmarks/tables_1_thread.md  | 40 ++++++++++++++++++++++++++++++++++++++++
A benchmarks/tables_4_threads.md  | 27 +++++++++++++++++++++++++++
A benchmarks/tables_all_solutions.md  | 26 ++++++++++++++++++++++++++

21 files changed, 316 insertions(+), 225 deletions(-)
diff --git a/benchmarks/benchmarks.md b/benchmarks/benchmarks.md
@@ -38,21 +38,6 @@ may be particularly inaccurate, because writing the solution (together
 with some log messages) to standard output may take a significant portion
 of the time, depending on the terminal emulator used and other factors.
 
-The main test we performed was finding a single optimal solution, and we
-compared the results with vcube. This test was run in a single-thread
-configuration and in two multithread configurations (with 4 and 16
-threads).  We also ran a test on finding *all* optimal solutions which,
-as far as I know, is a use case not supported by vcube; this latter test
-was only run on 16 threads.
-
-Since the size of the pruning table used by the solver is of utmost
-importance, we include two statistics: time per cube and time per cube
-*adjusted by table size*.  The adjustment we took simply consists in
-multiplying the time per cube by the size of the table. Empirically
-this leads to more even results across the board, as the speed of the
-solvers of the same family seems to scale linearly with the size of the
-pruning table.
-
 All benchmark tests were done on the following configuration:
 
 * CPU: AMD Ryzen 7 7700 (8 cores, 16 virtual threads)
@@ -61,9 +46,29 @@ All benchmark tests were done on the following configuration:
 * Operating system: Debian 13 (Linux kernel 6.12.57)
 * Compiler: GCC 14.2.0 for H48 and Clang 19.1.7 for vcube
 
-## Results (click on each item to expand)
+## Benchmark results
+
+### Single solution, single thread
+
+The first test we performed was finding a single optimal solution, and
+we compared the results with vcube.
 
-<details><summary>Single solution, single thread</summary>
+This test was run in a single-thread configuration, because the two
+solvers use very different strategies for multithreading: vcube can only
+parallelize by solving one cube per thread, while H48 is able to take
+advantage of multiple threads even when solving a single cube. Therefore
+we prefer to compare their single-threaded performance only.
+
+Since the size of the pruning table used by the solver is of utmost
+importance, we include two statistics: time per cube and time per cube
+*adjusted by table size*.  The adjustment we took simply consists in
+multiplying the time per cube by the size of the table. Empirically this
+leads to more even results across the board, although the speed of the
+solvers of the same family does not scale exactly linearly with the size
+of the pruning table.
+
+<!-- The following details block can be found in benchmarks/single_thread_comparison.md -->
+<details><summary>Results: Single solution, single thread</summary>
 
 Time per cube (in seconds, lower is better).
 
@@ -103,95 +108,92 @@ Time per cube adjusted for table size (in seconds \* GiB, lower is better).
 <img src="img/20moves1thread.png">
 </details>
 
-<details><summary>Single solution, 4 threads</summary>
+As we can see, adjusting for table size, H48 is generally faster than
+vcube.  The gap between the two solvers is larger for scrambles with
+longer optimal solutions.
+
+### Single solution, multiple threads
+
+The same benchmark as before is repeated using 4 and 16 threads (recall
+that the CPU used for these tests has 8 physical cores and 16 virtual
+threads).
+
+As mentioned above, we don't compare these results to vcube. Instead,
+we compare them with the single-threaded results for H48 and we show
+how far the speedup factor is from a theoretically optimal 4x and 16x.
+
+<!-- The following details block can be found in benchmarks/tables_4_threads.md -->
+<details><summary>Results: Single solution, 4 threads</summary>
 
 Time per cube (in seconds, lower is better).
 
 |  Solver  |  Size  |17 moves|18 moves|19 moves|20 moves|Superflip|
 |:---------|:-------|-------:|-------:|-------:|-------:|--------:|
-|vcube 212 |58.2 GiB|    0.03|    0.27|    1.04|    7.70|   (a)   |
 |H48 h11   |56.5 GiB|    0.04|    0.14|    0.60|    3.97|     4.19|
-|vcube 404 |31.8 GiB|    0.07|    0.30|    1.65|   16.17|   (a)   |
 |H48 h10   |28.3 GiB|    0.05|    0.20|    0.91|    6.65|     9.81|
-|vcube 308 |21.2 GiB|    0.05|    0.35|    1.78|   16.61|   (a)   |
 |H48 h9    |14.1 GiB|    0.07|    0.39|    1.74|   12.54|    18.96|
-|vcube 208 | 7.3 GiB|    0.16|    1.47|    5.86|        |   (a)   |
 |H48 h8    | 7.1 GiB|    0.13|    0.90|    3.71|        |         |
 |H48 h7    | 3.5 GiB|    0.17|    1.28|    6.12|        |         |
-|vcube 112 | 2.4 GiB|    0.29|    3.13|   11.95|        |   (a)   |
 |H48 h6    | 1.8 GiB|    0.33|    2.47|   12.22|        |         |
 
-Time per cube adjusted for table size (in seconds \* GiB, lower is better).
+Speed-up factor (higher is better).
 
 |  Solver  |  Size  |17 moves|18 moves|19 moves|20 moves|Superflip|
 |:---------|:-------|-------:|-------:|-------:|-------:|--------:|
-|vcube 212 |58.2 GiB|    2.03|   15.69|   60.25|  447.97|   (a)   |
-|H48 h11   |56.5 GiB|    2.04|    8.17|   33.77|  224.59|   236.92|
-|vcube 404 |31.8 GiB|    2.32|    9.50|   52.46|  514.50|   (a)   |
-|H48 h10   |28.3 GiB|    1.37|    5.76|   25.83|  187.88|   277.18|
-|vcube 308 |21.2 GiB|    1.02|    7.52|   37.82|  352.36|   (a)   |
-|H48 h9    |14.1 GiB|    1.03|    5.45|   24.58|  177.17|   267.79|
-|vcube 208 | 7.3 GiB|    1.18|   10.69|   42.63|        |   (a)   |
-|H48 h8    | 7.1 GiB|    0.92|    6.39|   26.20|        |         |
-|H48 h7    | 3.5 GiB|    0.62|    4.53|   21.62|        |         |
-|vcube 112 | 2.4 GiB|    0.69|    7.59|   28.97|        |   (a)   |
-|H48 h6    | 1.8 GiB|    0.59|    4.37|   21.61|        |         |
-
-(a) vcube cannot parallelize on a single scramble, the results for the
-superflip are going to be the same as in the single thread case.
-
-<img src="img/17moves4threads.png">
-<img src="img/18moves4threads.png">
-<img src="img/19moves4threads.png">
-<img src="img/20moves4threads.png">
+|H48 h11   |56.5 GiB|    2.55|    3.46|    3.75|    3.96|     3.71|
+|H48 h10   |28.3 GiB|    3.09|    3.71|    3.67|    3.54|     3.88|
+|H48 h9    |14.1 GiB|    3.25|    3.84|    3.85|        |         |
+|H48 h8    | 7.1 GiB|    3.53|    3.72|    3.81|        |         |
+|H48 h7    | 3.5 GiB|    3.60|    3.78|    3.80|        |         |
+|H48 h6    | 1.8 GiB|    3.77|    3.82|    3.79|        |         |
+<img src="img/4threads.png">
+
 </details>
 
-<details><summary>Single solution, 16 threads</summary>
+<!-- The following details block can be found in benchmarks/tables_16_threads.md -->
+<details><summary>Results: Single solution, 16 threads</summary>
 
 Time per cube (in seconds, lower is better).
 
 |  Solver  |  Size  |17 moves|18 moves|19 moves|20 moves|Superflip|
 |:---------|:-------|-------:|-------:|-------:|-------:|--------:|
-|vcube 212 |58.2 GiB|    0.02|    0.13|    0.45|    2.84|   (a)   |
 |H48 h11   |56.5 GiB|    0.02|    0.06|    0.22|    1.33|     1.84|
-|vcube 404 |31.8 GiB|    0.04|    0.14|    0.65|    6.08|   (a)   |
 |H48 h10   |28.3 GiB|    0.03|    0.08|    0.33|    2.34|     4.18|
-|vcube 308 |21.2 GiB|    0.03|    0.19|    0.78|    6.67|   (a)   |
 |H48 h9    |14.1 GiB|    0.04|    0.15|    0.64|    4.45|     8.09|
-|vcube 208 | 7.3 GiB|    0.08|    0.79|    2.43|        |   (a)   |
 |H48 h8    | 7.1 GiB|    0.06|    0.34|    1.36|        |         |
 |H48 h7    | 3.5 GiB|    0.07|    0.47|    2.20|        |         |
-|vcube 112 | 2.4 GiB|    0.15|    1.63|    5.10|        |   (a)   |
 |H48 h6    | 1.8 GiB|    0.13|    0.91|    4.39|        |         |
 
-Time per cube adjusted for table size (in seconds \* GiB, lower is better).
+Speed-up factor (higher is better).
 
 |  Solver  |  Size  |17 moves|18 moves|19 moves|20 moves|Superflip|
 |:---------|:-------|-------:|-------:|-------:|-------:|--------:|
-|vcube 212 |58.2 GiB|    0.95|    7.83|   26.04|  165.03|   (a)   |
-|H48 h11   |56.5 GiB|    1.34|    3.48|   12.19|   74.93|   103.98|
-|vcube 404 |31.8 GiB|    1.21|    4.60|   20.76|  193.43|   (a)   |
-|H48 h10   |28.3 GiB|    0.74|    2.37|    9.46|   66.13|   118.23|
-|vcube 308 |21.2 GiB|    0.67|    4.01|   16.48|  141.49|   (a)   |
-|H48 h9    |14.1 GiB|    0.51|    2.15|    8.98|   62.92|   114.31|
-|vcube 208 | 7.3 GiB|    0.56|    5.78|   17.68|        |   (a)   |
-|H48 h8    | 7.1 GiB|    0.40|    2.40|    9.59|        |         |
-|H48 h7    | 3.5 GiB|    0.26|    1.66|    7.77|        |         |
-|vcube 112 | 2.4 GiB|    0.35|    3.95|   12.37|        |   (a)   |
-|H48 h6    | 1.8 GiB|    0.23|    1.61|    7.76|        |         |
-
-(a) vcube cannot parallelize on a single scramble, the results for the
-superflip are going to be the same as in the single thread case.
-
-<img src="img/17moves16threads.png">
-<img src="img/18moves16threads.png">
-<img src="img/19moves16threads.png">
-<img src="img/20moves16threads.png">
+|H48 h11   |56.5 GiB|    3.90|    8.14|   10.39|   11.86|     8.45|
+|H48 h10   |28.3 GiB|    5.72|    9.00|   10.03|   10.04|     9.09|
+|H48 h9    |14.1 GiB|    6.56|    9.75|   10.53|        |         |
+|H48 h8    | 7.1 GiB|    8.08|    9.91|   10.40|        |         |
+|H48 h7    | 3.5 GiB|    8.59|   10.30|   10.57|        |         |
+|H48 h6    | 1.8 GiB|    9.60|   10.39|   10.54|        |         |
+<img src="img/16threads.png">
+
 </details>
 
-<details><summary>All solutions, 16 threads</summary>
+We can see that H48 scales pretty well with 4 threads, getting close
+to the 4x theoretical maximum speedup in slower cases (small table or
+long solutions).
+
+In the 16 threads benchmark shows that, although the virtual threads
+help push us beyond the 8x theoretical speedup that would be provided
+by the 8 cores, we are nowhere near a 16x speedup.
+
+### All solutions
 
-*Note: vcube does not have an option for finding multiple solutions.*
+Finally, we ran a test on finding *all* optimal solutions which, as
+far as I know, is a use case not supported by vcube. For convenience,
+this test is only run on 16 threads.
+
+<!-- The following details block can be found in benchmarks/all_solutions.md -->
+<details><summary>Results: All solutions (16 threads)</summary>
 
 Time per cube (in seconds, lower is better).
 
@@ -217,16 +219,6 @@ Time per cube adjusted for table size (in seconds \* GiB, lower is better).
 
 </details>
 
-## Comments on the results
-
-* Adjusting for table size, H48 is generally faster than vcube.
-* The gap between the two solvers is larger for scrambles with
-  longer optimal solutions.
-* On sets of 25 scrambles, H48 performs better than vcube when using
-  multiple threads, compared to their single-threaded performance.
-  However, this advantage will likely disappear (or
-  even invert) if we increase the size of the set.
-
 ## Other notes
 
 * To repeat the benchmarks, use `./benchmarks/run-h48-benchmarks.sh`.
@@ -249,8 +241,3 @@ Time per cube adjusted for table size (in seconds \* GiB, lower is better).
 * For H48, both GCC and Clang have been tried, with the same options;
   the resulting executable was about 10% faster with GCC compared to Clang.
   vcube only supports compiling with Clang.
-* The performance of the H48 solver depends slightly, but measurably, on the
-  alignment of the pruning table in memory. This is not handled by the core
-  library, but by the program that uses it. For these tests, we have used as
-  reference implementation the program in `tools/301_solve_file`, which
-  ensures 64-byte alignment.
diff --git a/benchmarks/img/16threads.png b/benchmarks/img/16threads.png
Binary files differ.
diff --git a/benchmarks/img/17moves16threads.png b/benchmarks/img/17moves16threads.png
Binary files differ.
diff --git a/benchmarks/img/17moves1thread.png b/benchmarks/img/17moves1thread.png
Binary files differ.
diff --git a/benchmarks/img/17moves4threads.png b/benchmarks/img/17moves4threads.png
Binary files differ.
diff --git a/benchmarks/img/18moves16threads.png b/benchmarks/img/18moves16threads.png
Binary files differ.
diff --git a/benchmarks/img/18moves1thread.png b/benchmarks/img/18moves1thread.png
Binary files differ.
diff --git a/benchmarks/img/18moves4threads.png b/benchmarks/img/18moves4threads.png
Binary files differ.
diff --git a/benchmarks/img/19moves16threads.png b/benchmarks/img/19moves16threads.png
Binary files differ.
diff --git a/benchmarks/img/19moves1thread.png b/benchmarks/img/19moves1thread.png
Binary files differ.
diff --git a/benchmarks/img/19moves4threads.png b/benchmarks/img/19moves4threads.png
Binary files differ.
diff --git a/benchmarks/img/20moves16threads.png b/benchmarks/img/20moves16threads.png
Binary files differ.
diff --git a/benchmarks/img/20moves1thread.png b/benchmarks/img/20moves1thread.png
Binary files differ.
diff --git a/benchmarks/img/20moves4threads.png b/benchmarks/img/20moves4threads.png
Binary files differ.
diff --git a/benchmarks/img/4threads.png b/benchmarks/img/4threads.png
Binary files differ.
diff --git a/benchmarks/plot-benchmarks.py b/benchmarks/plot-benchmarks.py
@@ -1,6 +1,14 @@
 import matplotlib.pyplot as plt
 import results_h48
 import results_vcube
+from pathlib import Path
+
+benchmarks_dir = Path("benchmarks")
+benchmarks_img_dir = benchmarks_dir / "img"
+benchmarks_single_thread = benchmarks_dir / "tables_1_thread.md"
+benchmarks_4_threads = benchmarks_dir / "tables_4_threads.md"
+benchmarks_16_threads = benchmarks_dir / "tables_16_threads.md"
+benchmarks_all_solutions = benchmarks_dir / "tables_all_solutions.md"
 
 # Table sizes in bytes
 
@@ -23,7 +31,7 @@ sizes_vcube = {
 
 # Printing tables in markdown format
 
-def print_row(solver_name, solver_size, dict, mul_by_size, superflip_star):
+def print_row(f, solver_name, solver_size, dict, mul_by_size, superflip_star):
 	if dict is None:
 		return
 	solver_gib = solver_size / (2**30)
@@ -36,104 +44,106 @@ def print_row(solver_name, solver_size, dict, mul_by_size, superflip_star):
 		f"{dict["superflip"]*m:>9.2f}" if "superflip" in dict else s
 	]
 	sep = "|"
-	print(sep + sep.join(cols) + sep)
+	f.write(sep + sep.join(cols) + sep + "\n")
 
-def print_table(h48, vcube, ms, st):
+def print_table(f, h48, vcube, ms, st):
 	vcube = vcube or {}
-	print("|  Solver  |  Size  |17 moves|18 moves|19 moves|20 moves|Superflip|")
-	print("|:---------|:-------|-------:|-------:|-------:|-------:|--------:|")
-	print_row("vcube 212", sizes_vcube[212], vcube.get(212), ms, not st)
-	print_row("H48 h11", sizes_h48[11], h48[11], ms, False)
-	print_row("vcube 404", sizes_vcube[404], vcube.get(404), ms, not st)
-	print_row("H48 h10", sizes_h48[10], h48[10], ms, False)
-	print_row("vcube 308", sizes_vcube[308], vcube.get(308), ms, not st)
-	print_row("H48 h9", sizes_h48[9], h48[9], ms, False)
-	print_row("vcube 208", sizes_vcube[208], vcube.get(208), ms, not st)
-	print_row("H48 h8", sizes_h48[8], h48[8], ms, False)
-	print_row("H48 h7", sizes_h48[7], h48[7], ms, False)
-	print_row("vcube 112", sizes_vcube[112], vcube.get(112), ms, not st)
-	print_row("H48 h6", sizes_h48[6], h48[6], ms, False)
-
-print("<details><summary>Single solution, single thread</summary>")
-print()
-print("Time per cube (in seconds, lower is better).")
-print()
-print_table(results_h48.h48_single_thread, results_vcube.vcube_single_thread, False, True)
-print()
-print("Time per cube adjusted for table size (in seconds \\* GiB, lower is better).")
-print()
-print_table(results_h48.h48_single_thread, results_vcube.vcube_single_thread, True, True)
-print()
-print(
-	'<img src="img/17moves1thread.png">\n'
-	'<img src="img/18moves1thread.png">\n'
-	'<img src="img/19moves1thread.png">\n'
-	'<img src="img/20moves1thread.png">'
-)
-print("</details>")
-
-print()
-print("<details><summary>Single solution, 4 threads</summary>")
-print()
-print("Time per cube (in seconds, lower is better).")
-print()
-print_table(results_h48.h48_4_threads, results_vcube.vcube_4_threads, False, False)
-print()
-print("Time per cube adjusted for table size (in seconds \\* GiB, lower is better).")
-print()
-print_table(results_h48.h48_4_threads, results_vcube.vcube_4_threads, True, False)
-print()
-print("(a) vcube cannot parallelize on a single scramble, the results for the")
-print("superflip are going to be the same as in the single thread case.")
-print()
-print(
-	'<img src="img/17moves4threads.png">\n'
-	'<img src="img/18moves4threads.png">\n'
-	'<img src="img/19moves4threads.png">\n'
-	'<img src="img/20moves4threads.png">'
-)
-print("</details>")
-
-print()
-print("<details><summary>Single solution, 16 threads</summary>")
-print()
-print("Time per cube (in seconds, lower is better).")
-print()
-print_table(results_h48.h48_16_threads, results_vcube.vcube_16_threads, False, False)
-print()
-print("Time per cube adjusted for table size (in seconds \\* GiB, lower is better).")
-print()
-print_table(results_h48.h48_16_threads, results_vcube.vcube_16_threads, True, False)
-print()
-print("(a) vcube cannot parallelize on a single scramble, the results for the")
-print("superflip are going to be the same as in the single thread case.")
-print()
-print(
-	'<img src="img/17moves16threads.png">\n'
-	'<img src="img/18moves16threads.png">\n'
-	'<img src="img/19moves16threads.png">\n'
-	'<img src="img/20moves16threads.png">'
-)
-print("</details>")
-
-print()
-print("<details><summary>All solutions, 16 threads</summary>")
-print()
-print("*Note: vcube does not have an option for finding multiple solutions.*")
-print()
-print("Time per cube (in seconds, lower is better).")
-print()
-print_table(results_h48.h48_all_solutions, None, False, False)
-print()
-print("Time per cube adjusted for table size (in seconds \\* GiB, lower is better).")
-print()
-print_table(results_h48.h48_all_solutions, None, True, False)
-print()
-print("</details>")
+	f.write("|  Solver  |  Size  |17 moves|18 moves|19 moves|20 moves|Superflip|\n")
+	f.write("|:---------|:-------|-------:|-------:|-------:|-------:|--------:|\n")
+	print_row(f, "vcube 212", sizes_vcube[212], vcube.get(212), ms, not st)
+	print_row(f, "H48 h11", sizes_h48[11], h48[11], ms, False)
+	print_row(f, "vcube 404", sizes_vcube[404], vcube.get(404), ms, not st)
+	print_row(f, "H48 h10", sizes_h48[10], h48[10], ms, False)
+	print_row(f, "vcube 308", sizes_vcube[308], vcube.get(308), ms, not st)
+	print_row(f, "H48 h9", sizes_h48[9], h48[9], ms, False)
+	print_row(f, "vcube 208", sizes_vcube[208], vcube.get(208), ms, not st)
+	print_row(f, "H48 h8", sizes_h48[8], h48[8], ms, False)
+	print_row(f, "H48 h7", sizes_h48[7], h48[7], ms, False)
+	print_row(f, "vcube 112", sizes_vcube[112], vcube.get(112), ms, not st)
+	print_row(f, "H48 h6", sizes_h48[6], h48[6], ms, False)
+
+def print_factor_table(f, slow, fast):
+	ratio = {}
+	for m in [6, 7, 8, 9, 10, 11]:
+		ratio[m] = {}
+		for k in slow[m]:
+			ratio[m][k] = (slow[m][k] / fast[m][k]) * (25 if k != "superflip" else 1)
+	f.write("|  Solver  |  Size  |17 moves|18 moves|19 moves|20 moves|Superflip|\n")
+	f.write("|:---------|:-------|-------:|-------:|-------:|-------:|--------:|\n")
+	print_row(f, "H48 h11", sizes_h48[11], ratio[11], False, False)
+	print_row(f, "H48 h10", sizes_h48[10], ratio[10], False, False)
+	print_row(f, "H48 h9", sizes_h48[9], ratio[9], False, False)
+	print_row(f, "H48 h8", sizes_h48[8], ratio[8], False, False)
+	print_row(f, "H48 h7", sizes_h48[7], ratio[7], False, False)
+	print_row(f, "H48 h6", sizes_h48[6], ratio[6], False, False)
+
+with open(benchmarks_single_thread, "w") as f:
+	f.write(f"<!-- The following details block can be found in {benchmarks_single_thread} -->\n") 
+	f.write("<details><summary>Results: Single solution, single thread</summary>\n")
+	f.write("\n")
+	f.write("Time per cube (in seconds, lower is better).\n")
+	f.write("\n")
+	print_table(f, results_h48.h48_single_thread, results_vcube.vcube_single_thread, False, True)
+	f.write("\n")
+	f.write("Time per cube adjusted for table size (in seconds \\* GiB, lower is better).\n")
+	f.write("\n")
+	print_table(f, results_h48.h48_single_thread, results_vcube.vcube_single_thread, True, True)
+	f.write("\n")
+	f.write(
+		'<img src="img/17moves1thread.png">\n'
+		'<img src="img/18moves1thread.png">\n'
+		'<img src="img/19moves1thread.png">\n'
+		'<img src="img/20moves1thread.png">\n'
+	)
+	f.write("</details>\n")
+
+with open(benchmarks_4_threads, "w") as f:
+	f.write(f"<!-- The following details block can be found in {benchmarks_4_threads} -->\n") 
+	f.write("<details><summary>Results: Single solution, 4 threads</summary>\n")
+	f.write("\n")
+	f.write("Time per cube (in seconds, lower is better).\n")
+	f.write("\n")
+	print_table(f, results_h48.h48_4_threads, None, False, False)
+	f.write("\n")
+	f.write("Speed-up factor (higher is better).\n")
+	f.write("\n")
+	print_factor_table(f, results_h48.h48_single_thread, results_h48.h48_4_threads)
+	f.write('<img src="img/4threadsspeedupfactor.png">\n')
+	f.write("\n")
+	f.write("</details>\n")
+
+with open(benchmarks_16_threads, "w") as f:
+	f.write(f"<!-- The following details block can be found in {benchmarks_16_threads} -->\n") 
+	f.write("<details><summary>Results: Single solution, 16 threads</summary>\n")
+	f.write("\n")
+	f.write("Time per cube (in seconds, lower is better).\n")
+	f.write("\n")
+	print_table(f, results_h48.h48_16_threads, None, False, False)
+	f.write("\n")
+	f.write("Speed-up factor (higher is better).\n")
+	f.write("\n")
+	print_factor_table(f, results_h48.h48_single_thread, results_h48.h48_16_threads)
+	f.write('<img src="img/16threadsspeedupfactor.png">\n')
+	f.write("\n")
+	f.write("</details>\n")
+
+with open(benchmarks_all_solutions, "w") as f:
+	f.write(f"<!-- The following details block can be found in {benchmarks_all_solutions} -->\n") 
+	f.write("<details><summary>Results: All solutions, 16 threads</summary>\n")
+	f.write("\n")
+	f.write("Time per cube (in seconds, lower is better).\n")
+	f.write("\n")
+	print_table(f, results_h48.h48_all_solutions, None, False, False)
+	f.write("\n")
+	f.write("Time per cube adjusted for table size (in seconds \\* GiB, lower is better).\n")
+	f.write("\n")
+	print_table(f, results_h48.h48_all_solutions, None, True, False)
+	f.write("\n")
+	f.write("</details>\n")
 
 # Plotting
 
-def plot(title, hd, vd, key):
+def plot_comparison(title, hd, vd, key):
 	d = 1 if key == "superflip" else 25
 	h48x = [sizes_h48[m]/(2**30) for m in hd.keys() if key in hd[m]]
 	vcubex = [sizes_vcube[m]/(2**30) for m in vd.keys() if key in vd[m]]
@@ -148,17 +158,27 @@ def plot(title, hd, vd, key):
 	plt.plot(vcubex, vcubey, "o--", label = "vcube")
 	plt.legend(loc = "right")
 	filename = title.replace(" ", "").replace(",", "") + ".png"
-	plt.savefig("benchmarks/img/" + filename, dpi=300)
+	plt.savefig(benchmarks_img_dir / filename, dpi=300)
 	#plt.show()
 
 rh, rv = results_h48.h48_single_thread, results_vcube.vcube_single_thread
 for m in [17, 18, 19, 20]:
-	plot(f"{m} moves 1 thread", rh, rv, m)
-
-rh, rv = results_h48.h48_4_threads, results_vcube.vcube_4_threads
-for m in [17, 18, 19, 20]:
-	plot(f"{m} moves 4 threads", rh, rv, m)
+	plot_comparison(f"{m} moves 1 thread", rh, rv, m)
 
-rh, rv = results_h48.h48_16_threads, results_vcube.vcube_16_threads
-for m in [17, 18, 19, 20]:
-	plot(f"{m} moves 16 threads", rh, rv, m)
+def plot_multithread_scatter(title, slow, fast):
+	plt.clf()
+	plt.title(title)
+	plt.xlabel("Moves")
+	plt.ylabel("Speed-up factor")
+	x = [17, 18, 19]
+	plt.xticks(x)
+	for h in [11, 10, 9, 8, 7, 6]:
+		y = [slow[h][i] / fast[h][i] for i in x]
+		plt.scatter(x, y, label=f"H48 h{h}")
+		plt.legend(loc = "right")
+		filename = title.replace(" ", "").replace(",", "") + ".png"
+		plt.savefig(benchmarks_img_dir / filename, dpi=300)
+	plt.show()
+
+plot_multithread_scatter("4 threads speedup factor", results_h48.h48_single_thread, results_h48.h48_4_threads)
+plot_multithread_scatter("16 threads speedup factor", results_h48.h48_single_thread, results_h48.h48_16_threads)
diff --git a/benchmarks/results_h48_clang.py b/benchmarks/results_h48_clang.py
@@ -1,36 +0,0 @@
-h48_single_thread = {
-	6: {17: 57.7296, 18: 436.0622, 19: 2240.5087},
-	7: {17: 27.8748, 18: 222.3812, 19: 1114.8537},
-	8: {17: 23.6884, 18: 181.7169, 19: 722.1819},
-	9: {17: 10.1115, 18: 72.6935, 19: 328.5458},
-	10: {17: 7.0337, 18: 37.5239, 19: 179.3647, 20: 1267.3059, "superflip": 71.8359},
-	11: {17: 3.9693, 18: 24.2573, 19: 106.8995, 20: 688.5594, "superflip": 32.0008},
-}
-
-h48_4_threads = {
-	6: {17: 16.7353, 18: 129.9696, 19: 663.2930},
-	7: {17: 8.4315, 18: 65.4883, 19: 348.3440},
-	8: {17: 6.9127, 18: 51.8117, 19: 208.8438},
-	9: {17: 3.0985, 18: 21.2920, 19: 95.3467, 20: 711.2927, "superflip": 35.3772},
-	10: {17: 2.0592, 18: 10.6636, 19: 49.9811, 20: 388.0203, "superflip": 18.6665},
-	11: {17: 1.3065, 18: 6.9261, 19: 29.7725, 20: 197.9893, "superflip": 9.1158},
-}
-
-h48_16_threads = {
-	6: {17: 5.6335, 18: 39.2955, 19: 207.5730},
-	7: {17: 2.9328, 18: 21.0462, 19: 104.7917},
-	8: {17: 2.4790, 18: 17.0854, 19: 67.6617},
-	9: {17: 1.1561, 18: 7.0899, 19: 30.5952, 20: 220.9804, "superflip": 13.5599},
-	10: {17: 0.8002, 18: 3.6835, 19: 16.4910, 20: 114.4733, "superflip": 7.6359},
-	11: {17: 0.5475, 18: 2.4681, 19: 9.8224, 20: 60.7893, "superflip": 3.8795},
-}
-
-h48_all_solutions = {
-	6: {17: 18.3998, 18: 223.3861, 19: 2459.7513},
-	7: {17: 9.6264, 18: 114.2752, 19: 1242.9741},
-	8: {17: 7.1793, 18: 81.2145, 19: 783.8113},
-	9: {17: 3.3487, 18: 36.9225, 19: 363.1912},
-	10: {17: 2.0002, 18: 20.4565, 19: 187.5199, 20: 1570.6074, "superflip": 68.2016},
-	11: {17: 1.2833, 18: 12.3616, 19: 105.6184, 20: 872.5879, "superflip": 34.1071},
-}
-
diff --git a/benchmarks/tables_16_threads.md b/benchmarks/tables_16_threads.md
@@ -0,0 +1,27 @@
+<!-- The following details block can be found in benchmarks/tables_16_threads.md -->
+<details><summary>Results: Single solution, 16 threads</summary>
+
+Time per cube (in seconds, lower is better).
+
+|  Solver  |  Size  |17 moves|18 moves|19 moves|20 moves|Superflip|
+|:---------|:-------|-------:|-------:|-------:|-------:|--------:|
+|H48 h11   |56.5 GiB|    0.02|    0.06|    0.22|    1.33|     1.84|
+|H48 h10   |28.3 GiB|    0.03|    0.08|    0.33|    2.34|     4.18|
+|H48 h9    |14.1 GiB|    0.04|    0.15|    0.64|    4.45|     8.09|
+|H48 h8    | 7.1 GiB|    0.06|    0.34|    1.36|        |         |
+|H48 h7    | 3.5 GiB|    0.07|    0.47|    2.20|        |         |
+|H48 h6    | 1.8 GiB|    0.13|    0.91|    4.39|        |         |
+
+Speed-up factor (higher is better).
+
+|  Solver  |  Size  |17 moves|18 moves|19 moves|20 moves|Superflip|
+|:---------|:-------|-------:|-------:|-------:|-------:|--------:|
+|H48 h11   |56.5 GiB|    3.90|    8.14|   10.39|   11.86|     8.45|
+|H48 h10   |28.3 GiB|    5.72|    9.00|   10.03|   10.04|     9.09|
+|H48 h9    |14.1 GiB|    6.56|    9.75|   10.53|        |         |
+|H48 h8    | 7.1 GiB|    8.08|    9.91|   10.40|        |         |
+|H48 h7    | 3.5 GiB|    8.59|   10.30|   10.57|        |         |
+|H48 h6    | 1.8 GiB|    9.60|   10.39|   10.54|        |         |
+<img src="img/16threads.png">
+
+</details>
diff --git a/benchmarks/tables_1_thread.md b/benchmarks/tables_1_thread.md
@@ -0,0 +1,40 @@
+<!-- The following details block can be found in benchmarks/tables_1_thread.md -->
+<details><summary>Results: Single solution, single thread</summary>
+
+Time per cube (in seconds, lower is better).
+
+|  Solver  |  Size  |17 moves|18 moves|19 moves|20 moves|Superflip|
+|:---------|:-------|-------:|-------:|-------:|-------:|--------:|
+|vcube 212 |58.2 GiB|    0.11|    0.75|    3.43|   27.28|    19.30|
+|H48 h11   |56.5 GiB|    0.09|    0.50|    2.24|   15.73|    15.55|
+|vcube 404 |31.8 GiB|    0.23|    1.24|    6.10|   59.33|   268.26|
+|H48 h10   |28.3 GiB|    0.15|    0.76|    3.36|   23.51|    38.05|
+|vcube 308 |21.2 GiB|    0.17|    1.02|    6.20|   58.70|   604.35|
+|H48 h9    |14.1 GiB|    0.24|    1.48|    6.69|        |         |
+|vcube 208 | 7.3 GiB|    0.56|    4.36|   20.58|        |         |
+|H48 h8    | 7.1 GiB|    0.46|    3.36|   14.13|        |         |
+|H48 h7    | 3.5 GiB|    0.63|    4.85|   23.25|        |         |
+|vcube 112 | 2.4 GiB|    0.96|    9.29|   40.52|        |         |
+|H48 h6    | 1.8 GiB|    1.25|    9.45|   46.31|        |         |
+
+Time per cube adjusted for table size (in seconds \* GiB, lower is better).
+
+|  Solver  |  Size  |17 moves|18 moves|19 moves|20 moves|Superflip|
+|:---------|:-------|-------:|-------:|-------:|-------:|--------:|
+|vcube 212 |58.2 GiB|    6.43|   43.80|  199.38| 1587.43|  1122.73|
+|H48 h11   |56.5 GiB|    5.21|   28.28|  126.68|  889.01|   878.77|
+|vcube 404 |31.8 GiB|    7.40|   39.47|  194.01| 1887.94|  8535.87|
+|H48 h10   |28.3 GiB|    4.22|   21.37|   94.86|  664.22|  1075.08|
+|vcube 308 |21.2 GiB|    3.51|   21.71|  131.50| 1245.26| 12819.94|
+|H48 h9    |14.1 GiB|    3.34|   20.97|   94.57|        |         |
+|vcube 208 | 7.3 GiB|    4.08|   31.74|  149.68|        |         |
+|H48 h8    | 7.1 GiB|    3.25|   23.75|   99.82|        |         |
+|H48 h7    | 3.5 GiB|    2.22|   17.12|   82.14|        |         |
+|vcube 112 | 2.4 GiB|    2.33|   22.53|   98.23|        |         |
+|H48 h6    | 1.8 GiB|    2.21|   16.70|   81.86|        |         |
+
+<img src="img/17moves1thread.png">
+<img src="img/18moves1thread.png">
+<img src="img/19moves1thread.png">
+<img src="img/20moves1thread.png">
+</details>
diff --git a/benchmarks/tables_4_threads.md b/benchmarks/tables_4_threads.md
@@ -0,0 +1,27 @@
+<!-- The following details block can be found in benchmarks/tables_4_threads.md -->
+<details><summary>Results: Single solution, 4 threads</summary>
+
+Time per cube (in seconds, lower is better).
+
+|  Solver  |  Size  |17 moves|18 moves|19 moves|20 moves|Superflip|
+|:---------|:-------|-------:|-------:|-------:|-------:|--------:|
+|H48 h11   |56.5 GiB|    0.04|    0.14|    0.60|    3.97|     4.19|
+|H48 h10   |28.3 GiB|    0.05|    0.20|    0.91|    6.65|     9.81|
+|H48 h9    |14.1 GiB|    0.07|    0.39|    1.74|   12.54|    18.96|
+|H48 h8    | 7.1 GiB|    0.13|    0.90|    3.71|        |         |
+|H48 h7    | 3.5 GiB|    0.17|    1.28|    6.12|        |         |
+|H48 h6    | 1.8 GiB|    0.33|    2.47|   12.22|        |         |
+
+Speed-up factor (higher is better).
+
+|  Solver  |  Size  |17 moves|18 moves|19 moves|20 moves|Superflip|
+|:---------|:-------|-------:|-------:|-------:|-------:|--------:|
+|H48 h11   |56.5 GiB|    2.55|    3.46|    3.75|    3.96|     3.71|
+|H48 h10   |28.3 GiB|    3.09|    3.71|    3.67|    3.54|     3.88|
+|H48 h9    |14.1 GiB|    3.25|    3.84|    3.85|        |         |
+|H48 h8    | 7.1 GiB|    3.53|    3.72|    3.81|        |         |
+|H48 h7    | 3.5 GiB|    3.60|    3.78|    3.80|        |         |
+|H48 h6    | 1.8 GiB|    3.77|    3.82|    3.79|        |         |
+<img src="img/4threads.png">
+
+</details>
diff --git a/benchmarks/tables_all_solutions.md b/benchmarks/tables_all_solutions.md
@@ -0,0 +1,26 @@
+<!-- The following details block can be found in benchmarks/tables_all_solutions.md -->
+<details><summary>Results: All solutions, 16 threads</summary>
+
+Time per cube (in seconds, lower is better).
+
+|  Solver  |  Size  |17 moves|18 moves|19 moves|20 moves|Superflip|
+|:---------|:-------|-------:|-------:|-------:|-------:|--------:|
+|H48 h11   |56.5 GiB|    0.04|    0.26|    2.22|   18.96|    16.95|
+|H48 h10   |28.3 GiB|    0.05|    0.42|    3.82|   34.42|    36.80|
+|H48 h9    |14.1 GiB|    0.08|    0.73|    7.28|        |         |
+|H48 h8    | 7.1 GiB|    0.15|    1.56|   15.41|        |         |
+|H48 h7    | 3.5 GiB|    0.21|    2.38|   26.52|        |         |
+|H48 h6    | 1.8 GiB|    0.39|    4.67|   53.00|        |         |
+
+Time per cube adjusted for table size (in seconds \* GiB, lower is better).
+
+|  Solver  |  Size  |17 moves|18 moves|19 moves|20 moves|Superflip|
+|:---------|:-------|-------:|-------:|-------:|-------:|--------:|
+|H48 h11   |56.5 GiB|    2.11|   14.55|  125.21| 1071.23|   957.47|
+|H48 h10   |28.3 GiB|    1.48|   11.96|  107.83|  972.35|  1039.73|
+|H48 h9    |14.1 GiB|    1.08|   10.29|  102.88|        |         |
+|H48 h8    | 7.1 GiB|    1.03|   10.99|  108.87|        |         |
+|H48 h7    | 3.5 GiB|    0.74|    8.41|   93.69|        |         |
+|H48 h6    | 1.8 GiB|    0.70|    8.25|   93.68|        |         |
+
+</details>

	nissy-core The "engine" of nissy, including the H48 optimal solver.
	git clone https://git.tronto.net/nissy-core
	Download \| Log \| Files \| Refs \| README \| LICENSE

M	benchmarks/benchmarks.md	\|	159	++++++++++++++++++++++++++++++++++++-------------------------------------------
A	benchmarks/img/16threads.png	\|	0
D	benchmarks/img/17moves16threads.png	\|	0
M	benchmarks/img/17moves1thread.png	\|	0
D	benchmarks/img/17moves4threads.png	\|	0
D	benchmarks/img/18moves16threads.png	\|	0
M	benchmarks/img/18moves1thread.png	\|	0
D	benchmarks/img/18moves4threads.png	\|	0
D	benchmarks/img/19moves16threads.png	\|	0
M	benchmarks/img/19moves1thread.png	\|	0
D	benchmarks/img/19moves4threads.png	\|	0
D	benchmarks/img/20moves16threads.png	\|	0
M	benchmarks/img/20moves1thread.png	\|	0
D	benchmarks/img/20moves4threads.png	\|	0
A	benchmarks/img/4threads.png	\|	0
M	benchmarks/plot-benchmarks.py	\|	226	+++++++++++++++++++++++++++++++++++++++++++------------------------------------
D	benchmarks/results_h48_clang.py	\|	36	------------------------------------
A	benchmarks/tables_16_threads.md	\|	27	+++++++++++++++++++++++++++
A	benchmarks/tables_1_thread.md	\|	40	++++++++++++++++++++++++++++++++++++++++
A	benchmarks/tables_4_threads.md	\|	27	+++++++++++++++++++++++++++
A	benchmarks/tables_all_solutions.md	\|	26	++++++++++++++++++++++++++