regex.md - sebastiano.tronto.net - Source files and build scripts for my personal website

regex.md (8627B)
      1 # UNIX text filters, part 0 of 3: regular expressions
      2 
      3 *This post is part of a [series](../../series)*
      4 
      5 One of the most important features of UNIX and its descendants, if
      6 not *the* most important feature, is input / output redirection:
      7 the output of a command can be displayed to the user, written to a
      8 file or used as the input for another command seamlessly, without
      9 the the program knowing which of these things is happening. This
     10 is possible because most UNIX programs use *plain text* as their
     11 input/output language, which is understood equally well by the three
     12 types of users - humans, files and other running programs.
     13 
     14 Since this is such a fundamental feature of UNIX, I thought it would
     15 be nice to go through some of the standard tools that help the user
     16 take advantage of it. At first I thought of doing this as part of
     17 my *man page reading club* series, but in the end I decided to give
     18 them their own space. My other series has also been going on for
     19 more than a year now, so it is a good time to end it and start a
     20 new one.
     21 
     22 Let me then introduce you to: **UNIX text filters**.
     23 
     24 ## Text filters
     25 
     26 For the purpose of this blog series, a *text filter* is a program
     27 that reads plain text from standard input and writes a modified,
     28 or *filtered* version of the same text to standard output. According
     29 to the introductory paragraph, this definition includes most UNIX
     30 programs; but we are going to focus on the following three, in
     31 increasing order of complexity:
     32 
     33 * grep
     34 * sed
     35 * awk
     36 
     37 In order to unleash the true power of these tools, we first need
     38 to grasp the basics of
     39 [regular expressions](https://en.wikipedia.org/wiki/Regular_expression).
     40 And what better way to do it than following the dedicated
     41 [OpenBSD manual page](https://man.openbsd.org/OpenBSD-7.3/re_format)?
     42 
     43 ## (Extended) regular expressions
     44 
     45 Regular expressions, or regexes for short, are a convenient way to
     46 describe text patterns. They are commonly used to solve genering
     47 string matching problems, such as determining if a given piece
     48 of text is a valid URL. Many standard UNIX tools, including the three
     49 we are going to cover in this series, support regexes.
     50 
     51 Let's deal with the nasty part first: even within POSIX, there is
     52 not one single standard for regular expressions; there are at least
     53 two of them: Basic Regular Expressions (BREs) and Extended Regular
     54 Expressions (ERE). As it always happens when there is more than one
     55 standard for the same thing, other people decided to come up with
     56 another version to replace all previous "standards", so we have also
     57 [PCREs](https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions),
     58 and probably more. [Things got out of hand quickly](https://xkcd.com/927).
     59 
     60 In this post I am going to follow the structure of
     61 [re_format(7)](https://man.openbsd.org/OpenBSD-7.3/re_format) and
     62 present *extended* regular expresssions first. After that I'll point
     63 out the differences with *basic* regular expressions.
     64 
     65 The goal is not to provide a complete guide to regexes, but rather
     66 an introduction to the most important features, glossing over the
     67 nasty edge-cases. Keep also in mind that I am in no way an expert
     68 on the subject: we are learning together, here!
     69 
     70 ### The basics
     71 
     72 You can think of a regular expression as a *pattern*, or a *rule*,
     73 that describes which strings are "valid" (they are *matched* by the
     74 regular expression) and which are not. As a trivial example, the
     75 regular expression `hello` matches only the string "hello". A less
     76 trivial example is the regex `.*` that matches *any* string.  I'll
     77 explain why in a second.
     78 
     79 Beware not to confuse regular expressions with *shell globs*, i.e.
     80 the rules for shell command expansion. Although they use similar
     81 symbols to achieve a similar goal, they are not the same thing. See
     82 [my post on sh(1)](../2022-09-13-sh-1) or
     83 [glob(7)](https://man.openbsd.org/OpenBSD-7.3/glob.7) for an
     84 explanation on shell globs.
     85 
     86 ### General structure and terminology
     87 
     88 A general regex looks something like this:
     89 
     90 ```
     91 piece piece piece ... | piece piece piece ... | ...
     92 ```
     93 
     94 A sequence of *pieces* is called a *branch*, and a regex is a
     95 sequence of branches separated by pipes `|`. Pieces are not separated
     96 by spaces, they are simply concatenated.
     97 
     98 The pipes `|` are read "or": a regex matches a given string if any
     99 of its branches does. A branch matches a given string if the latter
    100 can be written as a sequence of strings, each matching one of the
    101 pieces, in the given order.
    102 
    103 Before going into what pieces are exactly, consider the following
    104 example:
    105 
    106 ```
    107 hello|world
    108 ```
    109 
    110 This regex matches both the string "hello" and the string "world",
    111 and nothing else. The pieces are the single letters composing the
    112 two words, and as you can see they are juxtaposed without spaces.
    113 
    114 But what else is a valid piece? In general, a piece is made up of
    115 an *atom*, optionally followed by a *multiplier*.
    116 
    117 ### Atoms
    118 
    119 As we have already seen, the most simple kind of atom is a single
    120 character. The most *general* kind of atom, on the other hand, is
    121 a whole regular expression enclosed in parentheses `()`. Yes, regexes
    122 are recursive.
    123 
    124 There are some special characters: for example, a single dot `.`
    125 matches *any* single character. The characters `^` and `$` match
    126 an empty string at the beginning and at the end of a line, respectively.
    127 If you want to match a special character as if it was regular, say
    128 because you want to match strings that represent values in the
    129 dollar currency, you can *escape* them with a backslash. For example
    130 `\$` matches the string "$".
    131 
    132 The last kind of atoms are *bracket expressions*, which consist of
    133 lists of characters enclosed in brackets `[]`. A simple list of
    134 characters in brackets, like `[xyz]`, matches any character in the
    135 list, unless the first character is a `^`, in which case it matches
    136 every character *not* in the list. Two characters separated by a
    137 dash `-` denote a range: for example `[a-z]` matches every lowercase
    138 letter and `[1-7]` matches all digits from 1 to 7.
    139 
    140 You can also use cetain special sets of characters, like `[:lower:]`
    141 to match every lowercase letter (same as `[a-z]`), `[:alnum:]` to
    142 match every alphanumeric character or `[:digit:]` to match every
    143 decimal digit. Check the
    144 [man page](https://man.openbsd.org/OpenBSD-7.3/re_format)
    145 for the full list.
    146 
    147 ### Multipliers
    148 
    149 The term "multiplier" does not appear anywhere in the manual page, I
    150 made it up. But I think it fits, so I'll keep using it.
    151 
    152 Multipliers allow you to match an atom repeating a specified or
    153 unspecified amount of times. The most general one is the *bound*
    154 multiplier, which consists of one or two comma-separated numbers
    155 enclosed in braces `{}`.
    156 
    157 In its most simple form, the multiplier `{n}` repeats the multiplied
    158 atom `n` times. For example, the regex `a{7}` is equivalent to the
    159 regex `aaaaaaa` (and it matches the string "aaaaaaa").
    160 
    161 The form `{n,m}` matches *any number* between `n` and `m` of copies
    162 of the preceeding atom. For example `a{2,4}` is equivalent to
    163 `aa|aaa|aaaa`. If the integer `m` is not specified, the multiplied
    164 atom matches any string that consists of *at least* `n` copies of
    165 the atom.
    166 
    167 Now we can explain very quickly the more common multipliers `+`,
    168 `*` and `?`: they are equivalent to `{1,}`, `{0,}` and `{0,1}`
    169 respectively.  That is to say, `+` matches at least one copy of the
    170 atom, `*` matches any number of copies (including none) and `?`
    171 matches either one copy or none.
    172 
    173 ## Basic regular expressions
    174 
    175 Basic regular expressions are less powerful than their extended
    176 counterpart (with one exception, see below) and require more
    177 backslashes, but it is worth knowing them, because they are used
    178 by default in some programs (for example [ed(1)](../2022-12-24-ed)).
    179 The main differences between EREs and BREs are:
    180 
    181 * BREs consist of one single branch, i.e. there is no `|`.
    182 * Multipliers `+` and `?` do not exist.
    183 * You need to escape parentheses `\(\)` and braces `\{\}` to
    184   use them with their special meaning.
    185 
    186 There is one feature of BREs, called *back-reference*, that is
    187 absent in EREs. Apparently it makes the implementation much more
    188 complex, and it makes BREs more powerful. I noticed the author of
    189 the manual page despises back-references, so I am not going to learn
    190 them out of respect for them.
    191 
    192 ## Conclusion
    193 
    194 Regexes are a powerful tool, and they are more than worth knowing.
    195 But, quoting from the manual page:
    196 
    197 ```
    198      Having two kinds of REs is a botch.
    199 ```
    200 
    201 I hope you enjoyed this post, despite the lack of practical examples.
    202 If you want to see more applications of regular expressions, stay
    203 tuned for the next entries on grep, sed and awk!
    204 
    205 *Next in the series: [grep](../2023-08-20-grep)*
	sebastiano.tronto.net Source files and build scripts for my personal website
	git clone https://git.tronto.net/sebastiano.tronto.net
	Download \| Log \| Files \| Refs \| README