debugging-smartphone.md (7440B)
1 # I had to debug C code on a smartphone 2 3 A few days ago someone contacted me about an issue they had with 4 [nissy](https://nissy.tronto.net) - a project of mine that I have talked 5 about in [my last blog post](../2023-04-10-the-big-rewrite). 6 7 I was happy to look into it, but I could not reproduce the error in any 8 way, while this person ran into it consistently every time they tried 9 to use a certain functionality. 10 11 They were using a [Mac M1](https://en.wikipedia.org/wiki/Apple_M1), 12 which has an 13 [ARM-based CPU](https://en.wikipedia.org/wiki/ARM_architecture_family). 14 So I guessed the error was caused by me relying on some undefined 15 behavior of C that resulted in different compiled code on 16 [x86](https://en.wikipedia.org/wiki/X86) 17 and on ARM. But I had no ARM-based machine to debug this. 18 19 Except... 20 21 ## Everyone has an ARM computer 22 23 Most (if not all) smartphones have and ARM-based CPU. This means, at 24 least in theory, that if this bug was really related to this different 25 CPU architecture, I could reproduce it on my phone. 26 27 Nissy is a command line application. To compile it you just need a C 28 compiler + standard library and a terminal emulator. On Android there is 29 [termux](https://termux.dev), that I already use to ssh into my personal 30 server in case I need to check something on the go and to play around. 31 So I installed git, [clang](https://clang.llvm.org) and gdb on it, and 32 I was ready to go! 33 34 ![A screenshot of my phone running termux, debugging nissy](termux.jpg) 35 36 This was not the most pleasant experience. Yes, I could have installed 37 vim or some other text editor instead of using ed, but I don't think this 38 would have improved things all that much. I mostly edited the code on 39 my laptop and transferred my changes to my phone with quick git push & 40 pulls, keeping text editing on the phone to a minimum. 41 42 And it worked! I was able to reproduce the bug on the first try in 43 this environment. In hindsight, I should have tried building nissy 44 with a different compiler first, which would have saved me the hassle 45 of working on a 5-inch screen. I tried afterwards, but I could not 46 reproduce the error this way. 47 48 ## The actual bug 49 50 The bug itself was just a classic out-of-bounds error. Simplifying a 51 bit, at the beginning of a file I had a bunch static arrays that looked 52 pretty much like this: 53 54 ``` 55 #define N 10000 56 static int a[N]; 57 static int b[N]; 58 ``` 59 60 The values in these arrays where written only once, in their respective 61 `initialize_a()` and `initialize_b()` functions, both called at startup. 62 63 The second array `b[]` was initialized correctly, but the value `b[0]` 64 changed after calling `initialize_a()`, which in theory did not touch 65 `b[]` in any way. But, due to some wrong logic, in this function I ended 66 up writing some value into `a[N]`, which is out of the bounds of array 67 `a[]`. Apparently, when targeting ARM the compiler decided to allocate 68 the space for `a[]` and `b[]` in contiguous areas of memory, something 69 that did non happen on other architectures - perhaps some padding was 70 added between the two? 71 72 Once spotted, fixing the bug was easy: if a certain index `i` reached 73 the value `N`, the correct thing to do was to skip that value. I had 74 simply forgotten to check this. Adding an `if (i != N)` solved it. 75 76 ## Retrospective 77 78 Debugging on a smartphone is obviously not ideal, especially since nissy 79 (at least in its current form) is not meant to run on one. This motivated 80 me to think back and look for ways to prevent this kind of problem. 81 82 ### Testing 83 84 The error in the code had nothing to do with CPU architectures, it 85 was a logic error. The algorithm I had in mind was correct, but I 86 forgot one case and typed it out wrong. This is something that is 87 bound to happen to everyone, so how could I have avoided it? 88 89 A good way to spot errors in your logic is to write [unit 90 tests](https://en.wikipedia.org/wiki/Unit_testing). In this particular 91 case, though, I cannot think off the top of my head how to write a 92 unit test that would spot this error, at least when running on a x86 93 machine. In the end, the function `initialize_a()` achieved its goal - 94 albeit with an undesired side effect. 95 96 ### Better tools 97 98 In C, the size of an array is just an indication of how much memory 99 has to be allocated for it. There is no runtime check when accesing an 100 element. Most compilers can check for *static* out-of-bound accesses, i.e. 101 `int a[10]; a[11] = 0` will result in a warning (not even an error!) 102 at compile time. But even this would have not spotted my bug. 103 104 Tools like [Valgrind](https://valgrind.org) can help you analyze this 105 kind of memory-related issues, such as accessing unallocated memory 106 areas and memory leaks. However, to my surprise, valgrind did not help 107 here. I guess this is because the memory I ended up accessing was still 108 reserved for my code, just for a different array - or for some padding 109 between the two. Or perhaps I should have used more thorough settings. 110 111 There are modern languages that try prevent you from shooting yourself 112 on the foot, like [Rust](https://www.rust-lang.org). But for me C has a 113 huge advantage over any of these better-on-paper alternatives: I know it 114 decently well. Another good reason is ubiquity - I don't want to force 115 my few potential users to install a whole Rust environment just for nissy! 116 117 **Update:** After sharing this post, I have been advised to use the 118 compiler option `-fsanitize=address`, which adds some runtime 119 checks to detect this kind of memory errors. And it works! 120 Compiling the pre-bugfix version of the code with this extra option 121 and then launching nissy results in the following error: 122 123 ``` 124 src/coord.c:554:17: runtime error: index 70 out of bounds for type 'int [70]' 125 src/coord.c:554:36: runtime error: store to address 0x56383f5236b8 with insufficient space for an object of type 'int' 126 ... 127 ``` 128 129 [Sanitizers](https://github.com/google/sanitizers) are a 130 relatively recent compiler feature, available in `clang` 131 by default and in `gcc` via the external `libsanitizer` 132 library. The earliest reference I could find is a talk from 133 2011 ([YouTube video](https://www.youtube.com/watch?v=CPnRS1nv3_s), 134 ([slides](https://llvm.org/devmtg/2011-11/Serebryany_FindingRacesMemoryErrors.pdf)). 135 Coincidentally, I had just read about them in [a blog 136 post](https://nullprogram.com/blog/2023/04/29) a week ago, but I did 137 not think about using them. From now on, I definitely will! 138 139 ### Real world checks 140 141 Running your software on more platforms and making sure everything 142 works as expected is a good way to spot errors that are architecture- 143 or compiler-dependent. I am definitely not going to buy a Mac M1 just to 144 test out this toy project, but I could at least test it on all the devices 145 I have - including my phone. Since it is a command-line application, 146 setting up a test suite that runs a bunch of commands and then checks 147 that the outpus is as expected would be relatively easy. 148 149 ## Conclusion 150 151 Typing on a phone is painful. Nonetheless, debugging this was actually 152 kind of fun. 153 154 Knowing some low-level stuff always helps. In this case, I was able to 155 reproduce the issue only because I knew that different CPU architectures 156 exists, and that a Mac M1 is similar to an Android phone in this regard. 157 158 But I also want to stress that this bug was not related to the CPU 159 architecture: there was a logic error in my code. The fact that it was 160 only visible on ARM is a coincidence. In the end, correct logic is the 161 most important thing in coding.