fail.md (3945B)
1 # Sometimes it's the hardware 2 3 In the last few weeks, I was debugging a nasty bug on a hobby project. 4 I needed to work with ~58GB of data generated by the program itself, 5 which made everything quite slow and cumbersome to work with. But this 6 specific piece of code was quite simple and I was able to rule out 7 every possible cause of the bug, so where could the problem be? 8 9 (Ok I think you can guess it from the title, but I want to tell you the 10 whole story anyway.) 11 12 ## The bug 13 14 Long story short, my program is a 15 [Rubik's cube solver](https://nissy.tronto.net/) and I need 16 to generate some lookup tables to speed up the solution search. 17 Pretty much a standard 18 [IDA*](https://en.wikipedia.org/wiki/Iterative_deepening_A*) method: 19 the larger the table, the faster the solver. 20 21 I am implementing a new kind of tables, and I know how to reliably 22 generate one of around 30MB and one of around 58GB. I can also compute 23 tables of different sizes between those two, but the method I use is 24 experimental and I am not 100% sure the results are correct. But I can 25 also generate the intermediate tables by deriving them from the 58GB one. 26 This method is quite slow and can only work if the user has enough RAM, 27 but it gives me a way to test the correctness of the other method. 28 29 The algorithm to derive the smaller tables from the huge one is quite 30 simple, but to my surprise I was getting different results at every 31 run. I thought it must be some kind of 32 [undefined behavior](https://en.wikipedia.org/wiki/Undefined_behavior), 33 as it often the case with C. 34 35 Time to take out my debugging weapons! 36 37 ## The whole arsenal 38 39 To have a reasonable chance at uncovering nasty bugs related to undefined 40 behavior, bad memory access, concurrency and other C programming 41 nightmares, I used all of the following: 42 43 * All compile-time warnings GCC offers, enabled with `-Wall -Wextra -pedantic`. 44 * [Sanitizers](https://github.com/google/sanitizers), in particular the 45 *address* and *undefined* sanitizers. The *thread* sanitizer could not be 46 used with the large 58GB table because it requires much more memory, but in 47 any case this specific computation is not parallelized (yet). 48 * `printf()` debugging. Not always effective when dealing with undefined 49 behavior, but it never hurts. 50 * [GDB](https://en.wikipedia.org/wiki/GNU_Debugger). 51 * [Valgrind](https://valgrind.org/). 52 53 But nothing, my code did not trigger any error with these tools. 54 Of course it could be an error in my logic, but why would I get a 55 different results every time? This just did not make sense. 56 57 ## Sometimes you are not stupid, sometimes the computer is broken 58 59 I started thinking that the OS could be doing something wrong. In fact, 60 I noticed that KDE and Firefox occasionally crashed when I used ~60GB of 61 RAM. Maybe the Kernel messed up [swapping](https://wiki.debian.org/Swap)? 62 And a KDE or Firefox bug related to unusual memory sizes and usage would 63 not be surprising. 64 65 So I tried giving a `swapoff -a` to disable swap, and then run the 66 program again. And I got inconsistent results and crashes once again. 67 68 At this point I started doubting my hardware, so I ran 69 [memtest86](https://www.memtest86.com/). Sure enough, something was wrong: 70 71 ![memtest FAIL message](fail.jpg) 72 73 By the way, isn't this screen beautiful? It looks like it comes straight 74 out of an 1980's hacker movie. 75 76 ## Conclusion 77 78 With some more testing I was able to determine that one of the two RAM 79 sticks works fine, and the other one is definitely broken. In case you 80 are curious, memtest86 gives a similar screen with a green PASS message 81 when all tests pass. 82 83 Everything is still covered by warranty, so I asked for a replacement. 84 Hopefully I'll have a working PC with 64GB of RAM again soon, but in 85 the meantime I think I'll survive with 32GB. 86 87 I have been programming for almost 20 years, and this is the first time 88 an error in my program was due to a faulty piece of hardware rather than 89 a bug in my code. Cool :)