sebastiano.tronto.net

Source files and build scripts for my personal website
git clone https://git.tronto.net/sebastiano.tronto.net
Download | Log | Files | Refs | README

fail.md (3945B)


      1 # Sometimes it's the hardware
      2 
      3 In the last few weeks, I was debugging a nasty bug on a hobby project.
      4 I needed to work with ~58GB of data generated by the program itself,
      5 which made everything quite slow and cumbersome to work with. But this
      6 specific piece of code was quite simple and I was able to rule out
      7 every possible cause of the bug, so where could the problem be?
      8 
      9 (Ok I think you can guess it from the title, but I want to tell you the
     10 whole story anyway.)
     11 
     12 ## The bug
     13 
     14 Long story short, my program is a
     15 [Rubik's cube solver](https://nissy.tronto.net/) and I need
     16 to generate some lookup tables to speed up the solution search.
     17 Pretty much a standard
     18 [IDA*](https://en.wikipedia.org/wiki/Iterative_deepening_A*) method:
     19 the larger the table, the faster the solver.
     20 
     21 I am implementing a new kind of tables, and I know how to reliably
     22 generate one of around 30MB and one of around 58GB. I can also compute
     23 tables of different sizes between those two, but the method I use is
     24 experimental and I am not 100% sure the results are correct.  But I can
     25 also generate the intermediate tables by deriving them from the 58GB one.
     26 This method is quite slow and can only work if the user has enough RAM,
     27 but it gives me a way to test the correctness of the other method.
     28 
     29 The algorithm to derive the smaller tables from the huge one is quite
     30 simple, but to my surprise I was getting different results at every
     31 run. I thought it must be some kind of
     32 [undefined behavior](https://en.wikipedia.org/wiki/Undefined_behavior),
     33 as it often the case with C.
     34 
     35 Time to take out my debugging weapons!
     36 
     37 ## The whole arsenal
     38 
     39 To have a reasonable chance at uncovering nasty bugs related to undefined
     40 behavior, bad memory access, concurrency and other C programming
     41 nightmares, I used all of the following:
     42 
     43 * All compile-time warnings GCC offers, enabled with `-Wall -Wextra -pedantic`.
     44 * [Sanitizers](https://github.com/google/sanitizers), in particular the
     45   *address* and *undefined* sanitizers. The *thread* sanitizer could not be
     46   used with the large 58GB table because it requires much more memory, but in
     47   any case this specific computation is not parallelized (yet).
     48 * `printf()` debugging. Not always effective when dealing with undefined
     49   behavior, but it never hurts.
     50 * [GDB](https://en.wikipedia.org/wiki/GNU_Debugger).
     51 * [Valgrind](https://valgrind.org/).
     52 
     53 But nothing, my code did not trigger any error with these tools.
     54 Of course it could be an error in my logic, but why would I get a
     55 different results every time? This just did not make sense.
     56 
     57 ## Sometimes you are not stupid, sometimes the computer is broken
     58 
     59 I started thinking that the OS could be doing something wrong. In fact,
     60 I noticed that KDE and Firefox occasionally crashed when I used ~60GB of
     61 RAM. Maybe the Kernel messed up [swapping](https://wiki.debian.org/Swap)?
     62 And a KDE or Firefox bug related to unusual memory sizes and usage would
     63 not be surprising.
     64 
     65 So I tried giving a `swapoff -a` to disable swap, and then run the
     66 program again. And I got inconsistent results and crashes once again.
     67 
     68 At this point I started doubting my hardware, so I ran
     69 [memtest86](https://www.memtest86.com/). Sure enough, something was wrong:
     70 
     71 ![memtest FAIL message](fail.jpg)
     72 
     73 By the way, isn't this screen beautiful? It looks like it comes straight
     74 out of an 1980's hacker movie.
     75 
     76 ## Conclusion
     77 
     78 With some more testing I was able to determine that one of the two RAM
     79 sticks works fine, and the other one is definitely broken. In case you
     80 are curious, memtest86 gives a similar screen with a green PASS message
     81 when all tests pass.
     82 
     83 Everything is still covered by warranty, so I asked for a replacement.
     84 Hopefully I'll have a working PC with 64GB of RAM again soon, but in
     85 the meantime I think I'll survive with 32GB.
     86 
     87 I have been programming for almost 20 years, and this is the first time
     88 an error in my program was due to a faulty piece of hardware rather than
     89 a bug in my code. Cool :)