sebastiano.tronto.net

Source files and build scripts for my personal website
git clone https://git.tronto.net/sebastiano.tronto.net
Download | Log | Files | Refs | README

sed.md (14382B)


      1 # UNIX text filters, part 2 of 3: sed
      2 
      3 *This post is part of a [series](../../series)*
      4 
      5 After the first (or second, depending on how you prefer to call ordinals
      6 in a 0-based system) episode on [`grep`](../2023-08-20-grep) we are ready
      7 to look at `sed`, the *stream* editor!
      8 
      9 You can think of `sed` as the weird cousin of [`ed`](../2022-12-24-ed),
     10 the standard editor, as they share much of their syntax. You could
     11 argue that `ed` is the weirder one, though.
     12 
     13 On the other hand, the *stream* part of `sed` is very peculiar,
     14 and I prefer to think about it as a sort of `grep` that can not
     15 only pick the desired lines, but also edit them. You can decide
     16 which point of view you prefer after reading this post!
     17 
     18 ## Basic usage
     19 
     20 The way sed works is easy to summarize: text is read from standard input
     21 (or from a given file) line by line, a command is applied to each line,
     22 and the output is printed. Pretty much the same as for `grep`, except
     23 for the *a command is applied* part. Therefore, the power of `sed`
     24 comes from the available commands.
     25 
     26 A typical sed command is run like this:
     27 
     28 ```
     29 $ sed [options] 'command' [file ...]
     30 ```
     31 
     32 Instead of diving into the formal definition of the
     33 grammar of sed, or following the
     34 [manual page](https://man.openbsd.org/sed),
     35 let's start with the basics.
     36 
     37 ### Replacing text: the `s` command
     38 
     39 Most of the times I use `sed`, and pretty much every time I use it
     40 in an interactive shell, I just use the *substitution command* `s`.
     41 If you have used `sed` in the past, chances are you have used `s`.
     42 
     43 As a basic example, say you want to replace all occurrences of the word
     44 "dog" with the word "cat". Then you can use `sed s/dog/cat/g`:
     45 
     46 ```
     47 $ echo "I love dogs! My dog is cute" | sed 's/dog/cat/g'
     48 I love cats! My cat is cute
     49 ```
     50 
     51 If you omit the `g` at the end, only the first occurrence on each line
     52 is replaced:
     53 
     54 ```
     55 $ echo "I love dogs! My dog is cute
     56 > Another dog line" | sed 's/dog/cat/g'
     57 I love cats! My dog is cute
     58 Another cat line
     59 ```
     60 
     61 ### Regular expressions
     62 
     63 Plain text substitution works fine in educational examples, but it may
     64 fail in real-world use cases:
     65 
     66 ```
     67 $ echo "Dogs are cool. My dog is called Doge." | sed 's/dog/cat/g'
     68 Dogs are cool. My cat is called Doge.
     69 ```
     70 
     71 Luckily, regular expressions come to rescue! The first part of
     72 a substitution command can be a (basic) regular expression. Most
     73 versions of `sed` also support extended regular expressions via
     74 the `-E` or `-r` options, though this is not mandated by
     75 [POSIX](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html).
     76 Check your local manual page, and see also the section **BSD sed vs GNU sed**
     77 below. For more info on regular expressions,
     78 see [part 0](../2023-06-16-regex) of this series.
     79 
     80 Back to our example. We can use:
     81 
     82 ```
     83 $ echo "Dogs are cool. My dog is called Doge." | sed 's/[Dd]og/cat/g'
     84 cats are cool. My cat is called cate.
     85 ```
     86 
     87 Ok, we had one problem and we solved it. Now we have two problems.
     88 
     89 One problem is that the name of the dog was also canged, as it contains the
     90 word "Dog". This can be fixed by using a more complicated regular
     91 expression that matches word boundaries. With GNU `sed` (the default
     92 in most Linux distros) the regular expression that matches dog or
     93 Dog only when it is a word is `\b[Dd]og\b`, while on most BSD systems
     94 it is `[[:<:]][Dd]og[[:>:]]`. As far as I know, none of these is
     95 mandated by POSIX; avoid them if you are writing portable shell
     96 scripts.
     97 
     98 The second problem is that the replacement text does not respect
     99 the replaced text's capitalization. One simple way to solve this
    100 is using multiple commands.
    101 
    102 ### Multiple commands
    103 
    104 A `sed` command can be a composition of multiple commands. This is
    105 true not only for `s`, but also for all other commands that we have
    106 not seen yet.
    107 
    108 Commands are concatenated with a semi-colon. For example:
    109 
    110 ```
    111 $ echo "Dogs are great, I love dogs!" | sed 's/dog/cat/g ; s/Dog/Cat/g'
    112 Cats are great, I love cats!
    113 ```
    114 
    115 Concatenated commands are applied, in the order they appear, to
    116 every line.  Beware that subsequent commands operate on the modified
    117 line! For example:
    118 
    119 ```
    120 $ echo "dogs and cats" | sed 's/dog/cat/g ; s/cat/dog/g'
    121 dogs and dogs
    122 ```
    123 
    124 There are other ways of giving `sed` multiple commands to execute
    125 for each line. Similarly to `grep`, you can use `-e COMMAND -e ...`
    126 to list more commands directly, or `-f FILE` to let sed read the
    127 commands from a file.
    128 
    129 ### Little trick: change the separator to avoid escaping slashes
    130 
    131 For the `s` command, the slash `/` is a special character; if you
    132 want to use it in your regular expression or in your substitution
    133 text, you need to escape it with a backslash `\`. For example, to
    134 change all the slashes to backslashes you can use something like:
    135 
    136 ```
    137 sed 's/\//\\/g'
    138 ```
    139 
    140 But you don't have to use the slash as a separator - actually, you can
    141 use any character other than a backslash or a newline. If you use a
    142 different separator, you don't need to escape slashes - though you do
    143 need to escape whatever separator you choose instead.  For example,
    144 to perform the same substitution as above you can use a pipe `|` as
    145 a separator:
    146 
    147 ```
    148 sed 's|/|\\|g'
    149 ```
    150 
    151 A bit better, but you still need to escape backslashes.
    152 
    153 ### Addresses
    154 
    155 In general, `sed` commands have the following form:
    156 
    157 ```
    158 [address[,address]]function[arguments]
    159 ```
    160 
    161 Addresses specify the range of lines of the text on which the given
    162 function is applied. If no address is given, the command is applied
    163 to all lines. With only one address the command applies to that
    164 single line. Addresses can be also a dollar sign `$`, matching the
    165 last line, or a regular expression  surrounded by slashes (e.g.
    166 `/re/`), matching all the lines that match the expression.
    167 
    168 Does this remind you of something? It should, if you have read my
    169 [post on `ed`, the standard editor](../2022-12-24-ed). Addresses in
    170 `sed` work in the same way, so I will cut it short here.
    171 
    172 As an example, a few days ago I wanted to add a tab to every line
    173 of a snippet of code, except for the first one. I used this:
    174 
    175 ```
    176 $ sed '2,$ s/^/TAB/'
    177 ```
    178 
    179 With a literal tab character (by pressing `Ctrl+V Ctrl+TAB`) instead
    180 of `TAB`. With GNU `sed` one can use `\t` instead.
    181 
    182 *(Recall that `^` means "the beginning of a line", so the command
    183 above inserts `TAB` at the beginning of each line from the second
    184 one to the last.)*
    185 
    186 ### More commands
    187 
    188 With `sed`, one can do more than just find & replace. Here are
    189 some of its other (simple) commands:
    190 
    191 **Delete**: `d`. You can use it on a range of lines, the default being
    192 every line. Unexpectedly useful trick: you can use `| sed 'd'` instead of
    193 `> /dev/null' to suppress all standard output!
    194 
    195 **Change**: `c`. The syntax is a bit different from what we have seen
    196 so far. For example, to replace every line that ends with `0` or `5`
    197 with `bar` you can use
    198 
    199 ```
    200 $ sed '/[05]$/ c\
    201 bar
    202 '
    203 ```
    204 
    205 Notice the newline before and after `bar`.
    206 
    207 The `c` command also behaves a bit differently from other commands
    208 when given a range of addresses, because it replaces the whole range
    209 instead of operating on each addressed line one by one.
    210 
    211 **Insert**: `i`. The syntax is the same as for the `c` command, but
    212 text is just inserted, without deleting the current line.
    213 
    214 **Print**: `p`. Lines are printed by default, but if you use the `-n`
    215 option they are not. Useless trick: `sed -n '/RE/p'` is equivalent to
    216 `grep 'RE'`!
    217 
    218 **Quit**: `q`. This can be used to terminate sed earlier instead of,
    219 for example, piping its result or its input through `head`. But it is
    220 mostly known for the meme "`head` is
    221 [harmful](https://harmful.cat-v.org/software/), use `sed 11q` instead".
    222 
    223 ## Advanced sed
    224 
    225 So far I have only described "simple" `sed` commands that operate line
    226 by line. These was pretty much all I knew about `sed` before writing
    227 this post. But then I found out that there are more advanced features,
    228 and I think they are worth mentioning.
    229 
    230 ### Pattern space and hold space
    231 
    232 Reading the OpenBSD manual page, right after the general description
    233 of how `sed` works, you can read the following sentence:
    234 
    235 ```
    236 Some of the functions use a hold space to save all or part of the pattern
    237 space for subsequent retrieval.
    238 ```
    239 
    240 So, let's see how this *hold space* works.
    241 
    242 There are 5 commands that manipulate or otherwise use the hold space:
    243 `g`, `G`, `h`, `H` and `x`. The command `g` replaces the contents of the
    244 pattern space with that of the hold space, while `G` appends the hold
    245 space to the pattern space (with a newline character in between). The
    246 commands `h` and `H` do the same, but in the other direction (pattern
    247 space to hold space); you can memorize them as the initials of "hold"
    248 and "get".  Finally, `x` swaps the contents of the two spaces.
    249 
    250 Ok, let's see an example. It's a bit hard for me to come up with a
    251 concrete one because I have never used this feature, so let's try
    252 a "puzzle example". Say you want to replace every empty line of a file
    253 with the content of the last line that started with a `>` character.
    254 
    255 For example, if you input this text:
    256 
    257 ```
    258 > To avoid edge cases, say the first line alway starts with >
    259 This is
    260 a paragraph
    261 
    262 Another paragraph
    263 
    264 > Now use this line
    265 After this line
    266 
    267 > Ok now this
    268 > Actually, this
    269 
    270 The end.
    271 ```
    272 
    273 You want to obtain:
    274 
    275 ```
    276 > To avoid edge cases, say the first line alway starts with >
    277 This is
    278 a paragraph
    279 > To avoid edge cases, say the first line alway starts with >
    280 Another paragraph
    281 > To avoid edge cases, say the first line alway starts with >
    282 > Now use this line
    283 After this line
    284 > Now use this line
    285 > Ok now this
    286 > Actually, this
    287 > Actually, this
    288 The end.
    289 ```
    290 
    291 To do this, you can use the following command:
    292 
    293 ```
    294 $ sed `/^>/h; /^$/g'
    295 ```
    296 
    297 As a reminder: We are using regular expressions to specify address;
    298 `^` matches the beginning of a line and `$` matches the end of a line,
    299 so `^$` matches a blank line.
    300 
    301 Yeah, this specific example is quite useless. Do you have any better
    302 example of use of the hold space in `sed`? Let me know!
    303 
    304 ### Branching
    305 
    306 I'll cover this very briefly because, like for the previous part about
    307 the hold space, I have never used it in practice.
    308 
    309 If you are writing a longer `sed` script, you may be interested in
    310 (conditionally) jumping to different parts of your code. To do this,
    311 you can set a label with with `: label` and branch to it with `b label`.
    312 You can jump to a `label` conditionally, depending whether there has
    313 been a text substitution or not since last reading an input line, using
    314 `t label`.
    315 
    316 As an example: say you want to replace some text, but also add some
    317 kind of log of your work - for example, a line of text explaining that
    318 a replacement happened. Then you can do something like this:
    319 
    320 ```
    321 $ sed 's/dog/cat/g; t log; b end; : log; { i\
    322 ! At least one substitution was performed in the next line:
    323 }; : end'
    324 ```
    325 
    326 In the code above we set two labels, `log` just before the command
    327 that adds the log line and `end` at the end of the `sed` script.  If a
    328 substitution happens, we jump to `log`; if we do not jump to `log`,
    329 then next instruction makes us jump directly to the `end`. Kinda like
    330 programming with `goto`s!
    331 
    332 In this example I had to wrap the `i` command in curly braces `{}`,
    333 otherwise the semicolon needed to separate it from `: end` command would
    334 have been treated as part of the text to be inserted.
    335 
    336 ## BSD sed vs GNU sed
    337 
    338 To conclude this post, I would like to highlight some of the differences
    339 between the
    340 [GNU implementation of `sed`](https://www.gnu.org/software/sed/manual/sed.html),
    341 which is found in most Linux distros except
    342 [Alpine](https://alpinelinux.org) and a few others, and the BSD version
    343 found in many
    344 [BSD operating systems](https://en.wikipedia.org/wiki/Berkeley_Software_Distribution),
    345 including MacOS. I am not sure all the BSD versions have the same features,
    346 but the main points discussed in this section should hold for all of them.
    347 
    348 Those listed below are all the differences I know of.  If you know
    349 more, feel free to send me an email and I'll add them here!
    350 
    351 ### BSD sed is more minimal
    352 
    353 In general BSD sed is more barebones, offering little more than POSIX mandates.
    354 If something can be done with BSD `sed` it can also be done with
    355 the GNU version, but the converse is not always true.
    356 
    357 GNU `sed` has some extra options, some more commands and an alternative
    358 syntax for some of the commands we have seen in this post - such as `c`
    359 and `i`. See
    360 [the Extended Commands section](https://www.gnu.org/software/sed/manual/sed.html#Extended-Commands)
    361 of the GNU manual for details.
    362 
    363 ### Escape sequences
    364 
    365 In GNU `sed` one can use escape sequences such as `\n` and `\t` not
    366 only in regular expressions, but also in text - for example, in the
    367 replacement part of an `s` command. In BSD `sed`, this is not possible:
    368 one must insert literal special characters in their command - for example
    369 by pressing `Ctrl+V Ctrl+TAB` or by breaking a command with a newline,
    370 which is a bit ugly in my opinion.
    371 
    372 Escape sequences can be used in regular expressions in both the GNU
    373 and in the BSD version, see the section **Sed Regular Expressions** in the
    374 [OpenBSD](https://man.openbsd.org/sed)
    375 or
    376 [FreeBSD](https://man.freebsd.org/cgi/man.cgi?query=sed&apropos=0&sektion=0&manpath=FreeBSD+14.0-RELEASE+and+Ports&arch=default&format=html)
    377 manual pages for details.
    378 
    379 ### Regular expression special syntax
    380 
    381 Both versions of `sed` let you choose between basic and extended regular
    382 expressions with the `-E` (or `-r`) flag, but the GNU version offers
    383 some new sets of characters not present in BSD.
    384 
    385 We have already seen `\b` (word boundary); others include `\w` (word characters,
    386 i.e. letters, digits or underscores) and `\s` (whitespace). See
    387 [the GNU manual](https://www.gnu.org/software/sed/manual/sed.html#regexp-extensions)
    388 for a full list.
    389 
    390 ## Until next time... sort of
    391 
    392 It took me a long time to write this, but I am personally quite happy
    393 with the result.  This is not a complete `sed` tutorial by any means,
    394 and the set of examples is not as comprehensive as the interested reader
    395 might like, but I think it is a decent overview.
    396 
    397 The next post in the series is supposed to be about `awk`, but I decided
    398 to take a small detour and talk about some other simple, special-purpose
    399 text filtering commands, such as `tr`, `head`, `fmt` and so on. Expect
    400 some short posts in this series before part 3 - after all, there are
    401 [uncountably many](https://en.wikipedia.org/wiki/Uncountable_set)
    402 numbers between 2 and 3!
    403 
    404 *Next in the series: [tr](../2024-01-13-tr)*