sebastiano.tronto.net

Source files and build scripts for my personal website
git clone https://git.tronto.net/sebastiano.tronto.net
Download | Log | Files | Refs | README

commit c02deeb873cf241e398901cb3e1f5d4bf3a0f33c
parent 3a9093f641d608de0344e1de04f125e581296814
Author: Sebastiano Tronto <sebastiano@tronto.net>
Date:   Sun, 29 Sep 2024 11:12:47 +0200

blog post

Diffstat:
Asrc/blog/2024-09-29-fail/fail.jpg | 0
Asrc/blog/2024-09-29-fail/fail.md | 89+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 89 insertions(+), 0 deletions(-)

diff --git a/src/blog/2024-09-29-fail/fail.jpg b/src/blog/2024-09-29-fail/fail.jpg Binary files differ. diff --git a/src/blog/2024-09-29-fail/fail.md b/src/blog/2024-09-29-fail/fail.md @@ -0,0 +1,89 @@ +# Sometimes it's the hardware + +In the last few weeks, I was debugging a nasty bug on a hobby project. +I needed to work with ~58GB of data generated by the program itself, +which made everything quite slow and cumbersome to work with. But this +specific piece of code was quite simple and I was able to rule out +every possible cause of the bug, so where could the problem be? + +(Ok I think you can guess it from the title, but I want to tell you the +whole story anyway.) + +## The bug + +Long story short, my program is a +[Rubik's cube solver](https://nissy.tronto.net/) and I need +to generate some lookup tables to speed up the solution search. +Pretty much a standard +[IDA*](https://en.wikipedia.org/wiki/Iterative_deepening_A*) method: +the larger the table, the faster the solver. + +I am implementing a new kind of tables, and I know how to reliably +generate one of around 30MB and one of around 58GB. I can also compute +tables of different sizes between those two, but the method I use is +experimental and I am not 100% sure the results are correct. But I can +also generate the intermediate tables by deriving them from the 58GB one. +This method is quite slow and can only work if the user has enough RAM, +but it gives me a way to test the correctness of the other method. + +The algorithm to derive the smaller tables from the huge one is quite +simple, but to my surprise I was getting different results at every +run. I thought it must be some kind of +[undefined behavior](https://en.wikipedia.org/wiki/Undefined_behavior), +as it often the case with C. + +Time to take out my debugging weapons! + +## The whole arsenal + +To have a reasonable chance at uncovering nasty bugs related to undefined +behavior, bad memory access, concurrency and other C programming +nightmares, I used all of the following: + +* All compile-time warnings GCC offers, enabled with `-Wall -Wextra -pedantic`. +* [Sanitizers](https://github.com/google/sanitizers), in particular the + *address* and *undefined* sanitizers. The *thread* sanitizer could not be + used with the large 58GB table because it requires much more memory, but in + any case this specific computation is not parallelized (yet). +* `printf()` debugging. Not always effective when dealing with undefined + behavior, but it never hurts. +* [GDB](https://en.wikipedia.org/wiki/GNU_Debugger). +* [Valgrind](https://valgrind.org/). + +But nothing, my code did not trigger any error with these tools. +Of course it could be an error in my logic, but why would I get a +different results every time? This just did not make sense. + +## Sometimes you are not stupid, sometimes the computer is broken + +I started thinking that the OS could be doing something wrong. In fact, +I noticed that KDE and Firefox occasionally crashed when I used ~60GB of +RAM. Maybe the Kernel messed up [swapping](https://wiki.debian.org/Swap)? +And a KDE or Firefox bug related to unusual memory sizes and usage would +not be surprising. + +So I tried giving a `swapoff -a` to disable swap, and then run the +program again. And I got inconsistent results and crashes once again. + +At this point I started doubting my hardware, so I ran +[memtest86](https://www.memtest86.com/). Sure enough, something was wrong: + +![memtest FAIL message](fail.jpg) + +By the way, isn't this screen beautiful? It looks like it comes straight +out of an 1980's hacker movie. + +## Conclusion + +With some more testing I was able to determine that one of the two RAM +sticks works fine, and the other one is definitely broken. In case you +are curious, memtest86 gives a similar screen with a green PASS message +when all tests pass. + +Everything is still covered by warranty, so I asked for a replacement. +Hopefully I'll have a working PC with 64GB of RAM again soon, but in +the meantime I think I'll survive with 32GB. + +I have been programming for almost 20 years, and this is the first time +an error in my program was due to a faulty piece of hardware rather than +a bug in my code. Cool :)