regex.md (8627B)
1 # UNIX text filters, part 0 of 3: regular expressions 2 3 *This post is part of a [series](../../series)* 4 5 One of the most important features of UNIX and its descendants, if 6 not *the* most important feature, is input / output redirection: 7 the output of a command can be displayed to the user, written to a 8 file or used as the input for another command seamlessly, without 9 the the program knowing which of these things is happening. This 10 is possible because most UNIX programs use *plain text* as their 11 input/output language, which is understood equally well by the three 12 types of users - humans, files and other running programs. 13 14 Since this is such a fundamental feature of UNIX, I thought it would 15 be nice to go through some of the standard tools that help the user 16 take advantage of it. At first I thought of doing this as part of 17 my *man page reading club* series, but in the end I decided to give 18 them their own space. My other series has also been going on for 19 more than a year now, so it is a good time to end it and start a 20 new one. 21 22 Let me then introduce you to: **UNIX text filters**. 23 24 ## Text filters 25 26 For the purpose of this blog series, a *text filter* is a program 27 that reads plain text from standard input and writes a modified, 28 or *filtered* version of the same text to standard output. According 29 to the introductory paragraph, this definition includes most UNIX 30 programs; but we are going to focus on the following three, in 31 increasing order of complexity: 32 33 * grep 34 * sed 35 * awk 36 37 In order to unleash the true power of these tools, we first need 38 to grasp the basics of 39 [regular expressions](https://en.wikipedia.org/wiki/Regular_expression). 40 And what better way to do it than following the dedicated 41 [OpenBSD manual page](https://man.openbsd.org/OpenBSD-7.3/re_format)? 42 43 ## (Extended) regular expressions 44 45 Regular expressions, or regexes for short, are a convenient way to 46 describe text patterns. They are commonly used to solve genering 47 string matching problems, such as determining if a given piece 48 of text is a valid URL. Many standard UNIX tools, including the three 49 we are going to cover in this series, support regexes. 50 51 Let's deal with the nasty part first: even within POSIX, there is 52 not one single standard for regular expressions; there are at least 53 two of them: Basic Regular Expressions (BREs) and Extended Regular 54 Expressions (ERE). As it always happens when there is more than one 55 standard for the same thing, other people decided to come up with 56 another version to replace all previous "standards", so we have also 57 [PCREs](https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions), 58 and probably more. [Things got out of hand quickly](https://xkcd.com/927). 59 60 In this post I am going to follow the structure of 61 [re_format(7)](https://man.openbsd.org/OpenBSD-7.3/re_format) and 62 present *extended* regular expresssions first. After that I'll point 63 out the differences with *basic* regular expressions. 64 65 The goal is not to provide a complete guide to regexes, but rather 66 an introduction to the most important features, glossing over the 67 nasty edge-cases. Keep also in mind that I am in no way an expert 68 on the subject: we are learning together, here! 69 70 ### The basics 71 72 You can think of a regular expression as a *pattern*, or a *rule*, 73 that describes which strings are "valid" (they are *matched* by the 74 regular expression) and which are not. As a trivial example, the 75 regular expression `hello` matches only the string "hello". A less 76 trivial example is the regex `.*` that matches *any* string. I'll 77 explain why in a second. 78 79 Beware not to confuse regular expressions with *shell globs*, i.e. 80 the rules for shell command expansion. Although they use similar 81 symbols to achieve a similar goal, they are not the same thing. See 82 [my post on sh(1)](../2022-09-13-sh-1) or 83 [glob(7)](https://man.openbsd.org/OpenBSD-7.3/glob.7) for an 84 explanation on shell globs. 85 86 ### General structure and terminology 87 88 A general regex looks something like this: 89 90 ``` 91 piece piece piece ... | piece piece piece ... | ... 92 ``` 93 94 A sequence of *pieces* is called a *branch*, and a regex is a 95 sequence of branches separated by pipes `|`. Pieces are not separated 96 by spaces, they are simply concatenated. 97 98 The pipes `|` are read "or": a regex matches a given string if any 99 of its branches does. A branch matches a given string if the latter 100 can be written as a sequence of strings, each matching one of the 101 pieces, in the given order. 102 103 Before going into what pieces are exactly, consider the following 104 example: 105 106 ``` 107 hello|world 108 ``` 109 110 This regex matches both the string "hello" and the string "world", 111 and nothing else. The pieces are the single letters composing the 112 two words, and as you can see they are juxtaposed without spaces. 113 114 But what else is a valid piece? In general, a piece is made up of 115 an *atom*, optionally followed by a *multiplier*. 116 117 ### Atoms 118 119 As we have already seen, the most simple kind of atom is a single 120 character. The most *general* kind of atom, on the other hand, is 121 a whole regular expression enclosed in parentheses `()`. Yes, regexes 122 are recursive. 123 124 There are some special characters: for example, a single dot `.` 125 matches *any* single character. The characters `^` and `$` match 126 an empty string at the beginning and at the end of a line, respectively. 127 If you want to match a special character as if it was regular, say 128 because you want to match strings that represent values in the 129 dollar currency, you can *escape* them with a backslash. For example 130 `\$` matches the string "$". 131 132 The last kind of atoms are *bracket expressions*, which consist of 133 lists of characters enclosed in brackets `[]`. A simple list of 134 characters in brackets, like `[xyz]`, matches any character in the 135 list, unless the first character is a `^`, in which case it matches 136 every character *not* in the list. Two characters separated by a 137 dash `-` denote a range: for example `[a-z]` matches every lowercase 138 letter and `[1-7]` matches all digits from 1 to 7. 139 140 You can also use cetain special sets of characters, like `[:lower:]` 141 to match every lowercase letter (same as `[a-z]`), `[:alnum:]` to 142 match every alphanumeric character or `[:digit:]` to match every 143 decimal digit. Check the 144 [man page](https://man.openbsd.org/OpenBSD-7.3/re_format) 145 for the full list. 146 147 ### Multipliers 148 149 The term "multiplier" does not appear anywhere in the manual page, I 150 made it up. But I think it fits, so I'll keep using it. 151 152 Multipliers allow you to match an atom repeating a specified or 153 unspecified amount of times. The most general one is the *bound* 154 multiplier, which consists of one or two comma-separated numbers 155 enclosed in braces `{}`. 156 157 In its most simple form, the multiplier `{n}` repeats the multiplied 158 atom `n` times. For example, the regex `a{7}` is equivalent to the 159 regex `aaaaaaa` (and it matches the string "aaaaaaa"). 160 161 The form `{n,m}` matches *any number* between `n` and `m` of copies 162 of the preceeding atom. For example `a{2,4}` is equivalent to 163 `aa|aaa|aaaa`. If the integer `m` is not specified, the multiplied 164 atom matches any string that consists of *at least* `n` copies of 165 the atom. 166 167 Now we can explain very quickly the more common multipliers `+`, 168 `*` and `?`: they are equivalent to `{1,}`, `{0,}` and `{0,1}` 169 respectively. That is to say, `+` matches at least one copy of the 170 atom, `*` matches any number of copies (including none) and `?` 171 matches either one copy or none. 172 173 ## Basic regular expressions 174 175 Basic regular expressions are less powerful than their extended 176 counterpart (with one exception, see below) and require more 177 backslashes, but it is worth knowing them, because they are used 178 by default in some programs (for example [ed(1)](../2022-12-24-ed)). 179 The main differences between EREs and BREs are: 180 181 * BREs consist of one single branch, i.e. there is no `|`. 182 * Multipliers `+` and `?` do not exist. 183 * You need to escape parentheses `\(\)` and braces `\{\}` to 184 use them with their special meaning. 185 186 There is one feature of BREs, called *back-reference*, that is 187 absent in EREs. Apparently it makes the implementation much more 188 complex, and it makes BREs more powerful. I noticed the author of 189 the manual page despises back-references, so I am not going to learn 190 them out of respect for them. 191 192 ## Conclusion 193 194 Regexes are a powerful tool, and they are more than worth knowing. 195 But, quoting from the manual page: 196 197 ``` 198 Having two kinds of REs is a botch. 199 ``` 200 201 I hope you enjoyed this post, despite the lack of practical examples. 202 If you want to see more applications of regular expressions, stay 203 tuned for the next entries on grep, sed and awk! 204 205 *Next in the series: [grep](../2023-08-20-grep)*