sebastiano.tronto.net

Source files and build scripts for my personal website
git clone https://git.tronto.net/sebastiano.tronto.net
Download | Log | Files | Refs | README

debugging-smartphone.md (7440B)


      1 # I had to debug C code on a smartphone
      2 
      3 A few days ago someone contacted me about an issue they had with
      4 [nissy](https://nissy.tronto.net) - a project of mine that I have talked
      5 about in [my last blog post](../2023-04-10-the-big-rewrite).
      6 
      7 I was happy to look into it, but I could not reproduce the error in any
      8 way, while this person ran into it consistently every time they tried
      9 to use a certain functionality.
     10 
     11 They were using a [Mac M1](https://en.wikipedia.org/wiki/Apple_M1),
     12 which has an
     13 [ARM-based CPU](https://en.wikipedia.org/wiki/ARM_architecture_family).
     14 So I guessed the error was caused by me relying on some undefined
     15 behavior of C that resulted in different compiled code on
     16 [x86](https://en.wikipedia.org/wiki/X86)
     17 and on ARM. But I had no ARM-based machine to debug this.
     18 
     19 Except...
     20 
     21 ## Everyone has an ARM computer
     22 
     23 Most (if not all) smartphones have and ARM-based CPU. This means, at
     24 least in theory, that if this bug was really related to this different
     25 CPU architecture, I could reproduce it on my phone.
     26 
     27 Nissy is a command line application. To compile it you just need a C
     28 compiler + standard library and a terminal emulator. On Android there is
     29 [termux](https://termux.dev), that I already use to ssh into my personal
     30 server in case I need to check something on the go and to play around.
     31 So I installed git, [clang](https://clang.llvm.org) and gdb on it, and
     32 I was ready to go!
     33 
     34 ![A screenshot of my phone running termux, debugging nissy](termux.jpg)
     35 
     36 This was not the most pleasant experience. Yes, I could have installed
     37 vim or some other text editor instead of using ed, but I don't think this
     38 would have improved things all that much. I mostly edited the code on
     39 my laptop and transferred my changes to my phone with quick git push &
     40 pulls, keeping text editing on the phone to a minimum.
     41 
     42 And it worked! I was able to reproduce the bug on the first try in
     43 this environment.  In hindsight, I should have tried building nissy
     44 with a different compiler first, which would have saved me the hassle
     45 of working on a 5-inch screen.  I tried afterwards, but I could not
     46 reproduce the error this way.
     47 
     48 ## The actual bug
     49 
     50 The bug itself was just a classic out-of-bounds error.  Simplifying a
     51 bit, at the beginning of a file I had a bunch static arrays that looked
     52 pretty much like this:
     53 
     54 ```
     55 #define N 10000
     56 static int a[N];
     57 static int b[N];
     58 ```
     59 
     60 The values in these arrays where written only once, in their respective
     61 `initialize_a()` and `initialize_b()` functions, both called at startup.
     62 
     63 The second array `b[]` was initialized correctly, but the value `b[0]`
     64 changed after calling `initialize_a()`, which in theory did not touch
     65 `b[]` in any way. But, due to some wrong logic, in this function I ended
     66 up writing some value into `a[N]`, which is out of the bounds of array
     67 `a[]`.  Apparently, when targeting ARM the compiler decided to allocate
     68 the space for `a[]` and `b[]` in contiguous areas of memory, something
     69 that did non happen on other architectures - perhaps some padding was
     70 added between the two?
     71 
     72 Once spotted, fixing the bug was easy: if a certain index `i` reached
     73 the value `N`, the correct thing to do was to skip that value. I had
     74 simply forgotten to check this. Adding an `if (i != N)` solved it.
     75 
     76 ## Retrospective
     77 
     78 Debugging on a smartphone is obviously not ideal, especially since nissy
     79 (at least in its current form) is not meant to run on one.  This motivated
     80 me to think back and look for ways to prevent this kind of problem.
     81 
     82 ### Testing
     83 
     84 The error in the code had nothing to do with CPU architectures, it
     85 was a logic error. The algorithm I had in mind was correct, but I
     86 forgot one case and typed it out wrong. This is something that is
     87 bound to happen to everyone, so how could I have avoided it?
     88 
     89 A good way to spot errors in your logic is to write [unit
     90 tests](https://en.wikipedia.org/wiki/Unit_testing).  In this particular
     91 case, though, I cannot think off the top of my head how to write a
     92 unit test that would spot this error, at least when running on a x86
     93 machine. In the end, the function `initialize_a()` achieved its goal -
     94 albeit with an undesired side effect.
     95 
     96 ### Better tools
     97 
     98 In C, the size of an array is just an indication of how much memory
     99 has to be allocated for it. There is no runtime check when accesing an
    100 element. Most compilers can check for *static* out-of-bound accesses, i.e.
    101 `int a[10]; a[11] = 0` will result in a warning (not even an error!)
    102 at compile time. But even this would have not spotted my bug.
    103 
    104 Tools like [Valgrind](https://valgrind.org) can help you analyze this
    105 kind of memory-related issues, such as accessing unallocated memory
    106 areas and memory leaks. However, to my surprise, valgrind did not help
    107 here. I guess this is because the memory I ended up accessing was still
    108 reserved for my code, just for a different array - or for some padding
    109 between the two.  Or perhaps I should have used more thorough settings.
    110 
    111 There are modern languages that try prevent you from shooting yourself
    112 on the foot, like [Rust](https://www.rust-lang.org). But for me C has a
    113 huge advantage over any of these better-on-paper alternatives: I know it
    114 decently well. Another good reason is ubiquity - I don't want to force
    115 my few potential users to install a whole Rust environment just for nissy!
    116 
    117 **Update:** After sharing this post, I have been advised to use the
    118 compiler option `-fsanitize=address`, which adds some runtime
    119 checks to detect this kind of memory errors. And it works!
    120 Compiling the pre-bugfix version of the code with this extra option
    121 and then launching nissy results in the following error:
    122 
    123 ```
    124 src/coord.c:554:17: runtime error: index 70 out of bounds for type 'int [70]'
    125 src/coord.c:554:36: runtime error: store to address 0x56383f5236b8 with insufficient space for an object of type 'int'
    126 ...
    127 ```
    128 
    129 [Sanitizers](https://github.com/google/sanitizers) are a
    130 relatively recent compiler feature, available in `clang`
    131 by default and in `gcc` via the external `libsanitizer`
    132 library. The earliest reference I could find is a talk from
    133 2011 ([YouTube video](https://www.youtube.com/watch?v=CPnRS1nv3_s),
    134 ([slides](https://llvm.org/devmtg/2011-11/Serebryany_FindingRacesMemoryErrors.pdf)).
    135 Coincidentally, I had just read about them in [a blog
    136 post](https://nullprogram.com/blog/2023/04/29) a week ago, but I did
    137 not think about using them. From now on, I definitely will!
    138 
    139 ### Real world checks
    140 
    141 Running your software on more platforms and making sure everything
    142 works as expected is a good way to spot errors that are architecture-
    143 or compiler-dependent. I am definitely not going to buy a Mac M1 just to
    144 test out this toy project, but I could at least test it on all the devices
    145 I have - including my phone.  Since it is a command-line application,
    146 setting up a test suite that runs a bunch of commands and then checks
    147 that the outpus is as expected would be relatively easy.
    148 
    149 ## Conclusion
    150 
    151 Typing on a phone is painful. Nonetheless, debugging this was actually
    152 kind of fun.
    153 
    154 Knowing some low-level stuff always helps. In this case, I was able to
    155 reproduce the issue only because I knew that different CPU architectures
    156 exists, and that a Mac M1 is similar to an Android phone in this regard.
    157 
    158 But I also want to stress that this bug was not related to the CPU
    159 architecture: there was a logic error in my code. The fact that it was
    160 only visible on ARM is a coincidence.  In the end, correct logic is the
    161 most important thing in coding.