sed.md (14382B)
1 # UNIX text filters, part 2 of 3: sed 2 3 *This post is part of a [series](../../series)* 4 5 After the first (or second, depending on how you prefer to call ordinals 6 in a 0-based system) episode on [`grep`](../2023-08-20-grep) we are ready 7 to look at `sed`, the *stream* editor! 8 9 You can think of `sed` as the weird cousin of [`ed`](../2022-12-24-ed), 10 the standard editor, as they share much of their syntax. You could 11 argue that `ed` is the weirder one, though. 12 13 On the other hand, the *stream* part of `sed` is very peculiar, 14 and I prefer to think about it as a sort of `grep` that can not 15 only pick the desired lines, but also edit them. You can decide 16 which point of view you prefer after reading this post! 17 18 ## Basic usage 19 20 The way sed works is easy to summarize: text is read from standard input 21 (or from a given file) line by line, a command is applied to each line, 22 and the output is printed. Pretty much the same as for `grep`, except 23 for the *a command is applied* part. Therefore, the power of `sed` 24 comes from the available commands. 25 26 A typical sed command is run like this: 27 28 ``` 29 $ sed [options] 'command' [file ...] 30 ``` 31 32 Instead of diving into the formal definition of the 33 grammar of sed, or following the 34 [manual page](https://man.openbsd.org/sed), 35 let's start with the basics. 36 37 ### Replacing text: the `s` command 38 39 Most of the times I use `sed`, and pretty much every time I use it 40 in an interactive shell, I just use the *substitution command* `s`. 41 If you have used `sed` in the past, chances are you have used `s`. 42 43 As a basic example, say you want to replace all occurrences of the word 44 "dog" with the word "cat". Then you can use `sed s/dog/cat/g`: 45 46 ``` 47 $ echo "I love dogs! My dog is cute" | sed 's/dog/cat/g' 48 I love cats! My cat is cute 49 ``` 50 51 If you omit the `g` at the end, only the first occurrence on each line 52 is replaced: 53 54 ``` 55 $ echo "I love dogs! My dog is cute 56 > Another dog line" | sed 's/dog/cat/g' 57 I love cats! My dog is cute 58 Another cat line 59 ``` 60 61 ### Regular expressions 62 63 Plain text substitution works fine in educational examples, but it may 64 fail in real-world use cases: 65 66 ``` 67 $ echo "Dogs are cool. My dog is called Doge." | sed 's/dog/cat/g' 68 Dogs are cool. My cat is called Doge. 69 ``` 70 71 Luckily, regular expressions come to rescue! The first part of 72 a substitution command can be a (basic) regular expression. Most 73 versions of `sed` also support extended regular expressions via 74 the `-E` or `-r` options, though this is not mandated by 75 [POSIX](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html). 76 Check your local manual page, and see also the section **BSD sed vs GNU sed** 77 below. For more info on regular expressions, 78 see [part 0](../2023-06-16-regex) of this series. 79 80 Back to our example. We can use: 81 82 ``` 83 $ echo "Dogs are cool. My dog is called Doge." | sed 's/[Dd]og/cat/g' 84 cats are cool. My cat is called cate. 85 ``` 86 87 Ok, we had one problem and we solved it. Now we have two problems. 88 89 One problem is that the name of the dog was also canged, as it contains the 90 word "Dog". This can be fixed by using a more complicated regular 91 expression that matches word boundaries. With GNU `sed` (the default 92 in most Linux distros) the regular expression that matches dog or 93 Dog only when it is a word is `\b[Dd]og\b`, while on most BSD systems 94 it is `[[:<:]][Dd]og[[:>:]]`. As far as I know, none of these is 95 mandated by POSIX; avoid them if you are writing portable shell 96 scripts. 97 98 The second problem is that the replacement text does not respect 99 the replaced text's capitalization. One simple way to solve this 100 is using multiple commands. 101 102 ### Multiple commands 103 104 A `sed` command can be a composition of multiple commands. This is 105 true not only for `s`, but also for all other commands that we have 106 not seen yet. 107 108 Commands are concatenated with a semi-colon. For example: 109 110 ``` 111 $ echo "Dogs are great, I love dogs!" | sed 's/dog/cat/g ; s/Dog/Cat/g' 112 Cats are great, I love cats! 113 ``` 114 115 Concatenated commands are applied, in the order they appear, to 116 every line. Beware that subsequent commands operate on the modified 117 line! For example: 118 119 ``` 120 $ echo "dogs and cats" | sed 's/dog/cat/g ; s/cat/dog/g' 121 dogs and dogs 122 ``` 123 124 There are other ways of giving `sed` multiple commands to execute 125 for each line. Similarly to `grep`, you can use `-e COMMAND -e ...` 126 to list more commands directly, or `-f FILE` to let sed read the 127 commands from a file. 128 129 ### Little trick: change the separator to avoid escaping slashes 130 131 For the `s` command, the slash `/` is a special character; if you 132 want to use it in your regular expression or in your substitution 133 text, you need to escape it with a backslash `\`. For example, to 134 change all the slashes to backslashes you can use something like: 135 136 ``` 137 sed 's/\//\\/g' 138 ``` 139 140 But you don't have to use the slash as a separator - actually, you can 141 use any character other than a backslash or a newline. If you use a 142 different separator, you don't need to escape slashes - though you do 143 need to escape whatever separator you choose instead. For example, 144 to perform the same substitution as above you can use a pipe `|` as 145 a separator: 146 147 ``` 148 sed 's|/|\\|g' 149 ``` 150 151 A bit better, but you still need to escape backslashes. 152 153 ### Addresses 154 155 In general, `sed` commands have the following form: 156 157 ``` 158 [address[,address]]function[arguments] 159 ``` 160 161 Addresses specify the range of lines of the text on which the given 162 function is applied. If no address is given, the command is applied 163 to all lines. With only one address the command applies to that 164 single line. Addresses can be also a dollar sign `$`, matching the 165 last line, or a regular expression surrounded by slashes (e.g. 166 `/re/`), matching all the lines that match the expression. 167 168 Does this remind you of something? It should, if you have read my 169 [post on `ed`, the standard editor](../2022-12-24-ed). Addresses in 170 `sed` work in the same way, so I will cut it short here. 171 172 As an example, a few days ago I wanted to add a tab to every line 173 of a snippet of code, except for the first one. I used this: 174 175 ``` 176 $ sed '2,$ s/^/TAB/' 177 ``` 178 179 With a literal tab character (by pressing `Ctrl+V Ctrl+TAB`) instead 180 of `TAB`. With GNU `sed` one can use `\t` instead. 181 182 *(Recall that `^` means "the beginning of a line", so the command 183 above inserts `TAB` at the beginning of each line from the second 184 one to the last.)* 185 186 ### More commands 187 188 With `sed`, one can do more than just find & replace. Here are 189 some of its other (simple) commands: 190 191 **Delete**: `d`. You can use it on a range of lines, the default being 192 every line. Unexpectedly useful trick: you can use `| sed 'd'` instead of 193 `> /dev/null' to suppress all standard output! 194 195 **Change**: `c`. The syntax is a bit different from what we have seen 196 so far. For example, to replace every line that ends with `0` or `5` 197 with `bar` you can use 198 199 ``` 200 $ sed '/[05]$/ c\ 201 bar 202 ' 203 ``` 204 205 Notice the newline before and after `bar`. 206 207 The `c` command also behaves a bit differently from other commands 208 when given a range of addresses, because it replaces the whole range 209 instead of operating on each addressed line one by one. 210 211 **Insert**: `i`. The syntax is the same as for the `c` command, but 212 text is just inserted, without deleting the current line. 213 214 **Print**: `p`. Lines are printed by default, but if you use the `-n` 215 option they are not. Useless trick: `sed -n '/RE/p'` is equivalent to 216 `grep 'RE'`! 217 218 **Quit**: `q`. This can be used to terminate sed earlier instead of, 219 for example, piping its result or its input through `head`. But it is 220 mostly known for the meme "`head` is 221 [harmful](https://harmful.cat-v.org/software/), use `sed 11q` instead". 222 223 ## Advanced sed 224 225 So far I have only described "simple" `sed` commands that operate line 226 by line. These was pretty much all I knew about `sed` before writing 227 this post. But then I found out that there are more advanced features, 228 and I think they are worth mentioning. 229 230 ### Pattern space and hold space 231 232 Reading the OpenBSD manual page, right after the general description 233 of how `sed` works, you can read the following sentence: 234 235 ``` 236 Some of the functions use a hold space to save all or part of the pattern 237 space for subsequent retrieval. 238 ``` 239 240 So, let's see how this *hold space* works. 241 242 There are 5 commands that manipulate or otherwise use the hold space: 243 `g`, `G`, `h`, `H` and `x`. The command `g` replaces the contents of the 244 pattern space with that of the hold space, while `G` appends the hold 245 space to the pattern space (with a newline character in between). The 246 commands `h` and `H` do the same, but in the other direction (pattern 247 space to hold space); you can memorize them as the initials of "hold" 248 and "get". Finally, `x` swaps the contents of the two spaces. 249 250 Ok, let's see an example. It's a bit hard for me to come up with a 251 concrete one because I have never used this feature, so let's try 252 a "puzzle example". Say you want to replace every empty line of a file 253 with the content of the last line that started with a `>` character. 254 255 For example, if you input this text: 256 257 ``` 258 > To avoid edge cases, say the first line alway starts with > 259 This is 260 a paragraph 261 262 Another paragraph 263 264 > Now use this line 265 After this line 266 267 > Ok now this 268 > Actually, this 269 270 The end. 271 ``` 272 273 You want to obtain: 274 275 ``` 276 > To avoid edge cases, say the first line alway starts with > 277 This is 278 a paragraph 279 > To avoid edge cases, say the first line alway starts with > 280 Another paragraph 281 > To avoid edge cases, say the first line alway starts with > 282 > Now use this line 283 After this line 284 > Now use this line 285 > Ok now this 286 > Actually, this 287 > Actually, this 288 The end. 289 ``` 290 291 To do this, you can use the following command: 292 293 ``` 294 $ sed `/^>/h; /^$/g' 295 ``` 296 297 As a reminder: We are using regular expressions to specify address; 298 `^` matches the beginning of a line and `$` matches the end of a line, 299 so `^$` matches a blank line. 300 301 Yeah, this specific example is quite useless. Do you have any better 302 example of use of the hold space in `sed`? Let me know! 303 304 ### Branching 305 306 I'll cover this very briefly because, like for the previous part about 307 the hold space, I have never used it in practice. 308 309 If you are writing a longer `sed` script, you may be interested in 310 (conditionally) jumping to different parts of your code. To do this, 311 you can set a label with with `: label` and branch to it with `b label`. 312 You can jump to a `label` conditionally, depending whether there has 313 been a text substitution or not since last reading an input line, using 314 `t label`. 315 316 As an example: say you want to replace some text, but also add some 317 kind of log of your work - for example, a line of text explaining that 318 a replacement happened. Then you can do something like this: 319 320 ``` 321 $ sed 's/dog/cat/g; t log; b end; : log; { i\ 322 ! At least one substitution was performed in the next line: 323 }; : end' 324 ``` 325 326 In the code above we set two labels, `log` just before the command 327 that adds the log line and `end` at the end of the `sed` script. If a 328 substitution happens, we jump to `log`; if we do not jump to `log`, 329 then next instruction makes us jump directly to the `end`. Kinda like 330 programming with `goto`s! 331 332 In this example I had to wrap the `i` command in curly braces `{}`, 333 otherwise the semicolon needed to separate it from `: end` command would 334 have been treated as part of the text to be inserted. 335 336 ## BSD sed vs GNU sed 337 338 To conclude this post, I would like to highlight some of the differences 339 between the 340 [GNU implementation of `sed`](https://www.gnu.org/software/sed/manual/sed.html), 341 which is found in most Linux distros except 342 [Alpine](https://alpinelinux.org) and a few others, and the BSD version 343 found in many 344 [BSD operating systems](https://en.wikipedia.org/wiki/Berkeley_Software_Distribution), 345 including MacOS. I am not sure all the BSD versions have the same features, 346 but the main points discussed in this section should hold for all of them. 347 348 Those listed below are all the differences I know of. If you know 349 more, feel free to send me an email and I'll add them here! 350 351 ### BSD sed is more minimal 352 353 In general BSD sed is more barebones, offering little more than POSIX mandates. 354 If something can be done with BSD `sed` it can also be done with 355 the GNU version, but the converse is not always true. 356 357 GNU `sed` has some extra options, some more commands and an alternative 358 syntax for some of the commands we have seen in this post - such as `c` 359 and `i`. See 360 [the Extended Commands section](https://www.gnu.org/software/sed/manual/sed.html#Extended-Commands) 361 of the GNU manual for details. 362 363 ### Escape sequences 364 365 In GNU `sed` one can use escape sequences such as `\n` and `\t` not 366 only in regular expressions, but also in text - for example, in the 367 replacement part of an `s` command. In BSD `sed`, this is not possible: 368 one must insert literal special characters in their command - for example 369 by pressing `Ctrl+V Ctrl+TAB` or by breaking a command with a newline, 370 which is a bit ugly in my opinion. 371 372 Escape sequences can be used in regular expressions in both the GNU 373 and in the BSD version, see the section **Sed Regular Expressions** in the 374 [OpenBSD](https://man.openbsd.org/sed) 375 or 376 [FreeBSD](https://man.freebsd.org/cgi/man.cgi?query=sed&apropos=0&sektion=0&manpath=FreeBSD+14.0-RELEASE+and+Ports&arch=default&format=html) 377 manual pages for details. 378 379 ### Regular expression special syntax 380 381 Both versions of `sed` let you choose between basic and extended regular 382 expressions with the `-E` (or `-r`) flag, but the GNU version offers 383 some new sets of characters not present in BSD. 384 385 We have already seen `\b` (word boundary); others include `\w` (word characters, 386 i.e. letters, digits or underscores) and `\s` (whitespace). See 387 [the GNU manual](https://www.gnu.org/software/sed/manual/sed.html#regexp-extensions) 388 for a full list. 389 390 ## Until next time... sort of 391 392 It took me a long time to write this, but I am personally quite happy 393 with the result. This is not a complete `sed` tutorial by any means, 394 and the set of examples is not as comprehensive as the interested reader 395 might like, but I think it is a decent overview. 396 397 The next post in the series is supposed to be about `awk`, but I decided 398 to take a small detour and talk about some other simple, special-purpose 399 text filtering commands, such as `tr`, `head`, `fmt` and so on. Expect 400 some short posts in this series before part 3 - after all, there are 401 [uncountably many](https://en.wikipedia.org/wiki/Uncountable_set) 402 numbers between 2 and 3! 403 404 *Next in the series: [tr](../2024-01-13-tr)*