sebastiano.tronto.net

Source files and build scripts for my personal website
git clone https://git.tronto.net/sebastiano.tronto.net
Download | Log | Files | Refs | README

grep.md (8286B)


      1 # UNIX text filters, part 1 of 3: grep
      2 
      3 After [the preliminary post on regular expressions](../2023-06-16-regex),
      4 we are ready to begin this series on *text filters*.
      5 
      6 This time we'll explore `grep`, the most simple kind of filter:
      7 given a bunch of lines of text, print out only those that match a
      8 certain criterion.
      9 
     10 I will only describe a few basic options. All that I mention here
     11 is POSIX-standard, with the exception of the option `-o`. This means
     12 that the content of this post is valid in pretty much any UNIX-like
     13 OS, but check your manual pages before copy-pasting my code - I can
     14 always make mistakes.
     15 
     16 Without further ado, let's dive in!
     17 
     18 ## Standard usage
     19 
     20 If you are familiar with how (UNIX) programs read from standard
     21 output and write to standard output, the idea behing `grep` is
     22 easily explained: the command
     23 
     24 ```
     25 $ grep PATTERN
     26 ```
     27 
     28 will read lines from standard input and write to standard output
     29 only those that contain the given `PATTERN`. If you specify file
     30 names after the pattern
     31 
     32 ```
     33 $ grep PATTERN file1 file2 ...
     34 ```
     35 
     36 `grep` will read those files instead of standard input. The `PATTERN`
     37 can also be a [regular expression](../2023-06-16-regex).
     38 
     39 In other words, you can use `grep` to look for certain pieces of
     40 text in a file or in the output of another command. If you do not
     41 understand all of this is about, start reading from the **Examples**
     42 section below to get an idea.
     43 
     44 Now let's see how you can tune `grep`'s behavior to your needs.
     45 
     46 ### What to match: `-i`, `-v`
     47 
     48 A common use of `grep`, especially for non-programming tasks, is
     49 to look for occurrences of a specific word in a long text. In
     50 this case one usually does not care if the word is all lowercase
     51 or capitalized, for example because at the beginning of a sentence.
     52 If you find yourself in this situation, you can use the `-i` option
     53 to make `grep` case-insensitive.
     54 
     55 Sometimes it easier to spell out what you *do not* want to match -
     56 for example, say you want all non-empty lines of a given file. In
     57 this case you can use the `-v` option to invert the behavior of
     58 `grep`, such as:
     59 
     60 ```
     61 $ grep -v "^$" file
     62 ```
     63 
     64 Here `"^$"` is a regular expression that matches all lines where the
     65 beginning of the line (in regex language, `^`) is immediately followed
     66 by the end of the line (`$`); in other words, empty lines.
     67 
     68 ### More on patterns: `-E`, `-e`, `-F`, `-f`
     69 
     70 Up to now I have not specified what *kind* of regular expression
     71 `grep` uses. By default it uses basic regular expressions, but it
     72 uses extended regular expressions if called with the `-E` option.
     73 Equivalentrly, you can use the command `egrep`.  If you want to
     74 turn off regular expressions altogether, you can use `grep -F` (or
     75 `fgrep`).
     76 
     77 If you want to select lines that match *any* of a number of patterns,
     78 you can use the `-e` option:
     79 
     80 ```
     81 $ grep -e PATTERN1 -e PATTERN2 -e ... [file1 file2 ...]
     82 ```
     83 
     84 Alternatively you can write your pattern in a file, one per line,
     85 and use:
     86 
     87 ```
     88 $ grep -f PATTERN_FILE [file1 file2 ...]
     89 ```
     90 
     91 ### Grepping multiple files: `-l`, `-n`
     92 
     93 Sometimes I use `grep` to find occurrences of a certain string in
     94 a bunch of files, for example with
     95 
     96 ```
     97 $ grep "word" *
     98 ```
     99 
    100 When used with multiple input files like this, `grep` will precede
    101 each output line with the name of the file that contains it. If the
    102 option `-n` is used, the line number is also shown. If `-l` is used,
    103 only the name of the file is shown, and each file is shown at most
    104 once.
    105 
    106 If you do not want to print the file names at all, you can always
    107 `cat` into `grep`:
    108 
    109 ```
    110 $ cat file1 file2 ... | grep
    111 ```
    112 
    113 But if anyone asks, you did not learn this from me - UUOC (Useless
    114 Use Of Cat) is a considered a crime in some circles.
    115 
    116 *Update 2023-09-02: I have just discovered that the the `-h` option can
    117 be used to hide the file names, so no need for piping cats. However,
    118 though present both in OpenBSD's and GNU's versions of `grep`, this
    119 option is not POSIX standard.*
    120 
    121 ### Matching only part of a line: `-o`
    122 
    123 You may not always want the *full line* containing a piece of text.
    124 Sometimes you just want a specific part of a line, and you know
    125 exactly how to match it with a regular expression. In this case you can
    126 use the `-o` option - we'll see an example below.
    127 
    128 The `-o` is not POSIX-standard. It is ubiquitous though, and it
    129 should be present in pretty much any version of `grep`.
    130 
    131 ## Examples
    132 
    133 Now that we now the basics, let's see some exciting applications
    134 of `grep`!
    135 
    136 Nah, I am kidding, they are not exciting. But they are useful. Boring,
    137 but useful.
    138 
    139 ### Filter command output
    140 
    141 Probably my first use of `grep` was to filter out irrelevant part of
    142 some command's output. Say for example you are troubleshooting a
    143 problem with your webcam: you can use `dmesg` to check what your
    144 operating system knows about it, but most of the output is useless
    145 to your specific problem.  No worries, you can pipe `dmesg` into
    146 `grep`:
    147 
    148 ```
    149 $ dmesg | grep video
    150 acpivideo0 at acpi0: VGA_
    151 acpivout0 at acpivideo0: LCDD
    152 uvideo0 at uhub0 port 6 configuration 1 interface 0 "JMICRON TECHNOLOGIES CO., LTD. USB2.0 UVC VGA WebCam" rev 2.00/2.04 addr 2
    153 video0 at uvideo0
    154 ```
    155 
    156 ### Look stuff up in files
    157 
    158 Sometimes you may want to search something in a bunch of files.
    159 Let's say for example I want to check in which of my old blog posts
    160 I have mentioned "Linux":
    161 
    162 ```
    163 $ grep -l Linux src/blog/*/*
    164 src/blog/2022-05-29-man/man.md
    165 src/blog/2022-08-14-website/website.md
    166 src/blog/2022-09-10-netbooks/netbooks.md
    167 src/blog/2023-01-28-windows-desktop/windows-desktop.md
    168 src/blog/2023-02-25-job-control/job-control.md
    169 src/blog/2023-02-25-job-control/jobs-diagram.pdf
    170 ```
    171 
    172 Or say I am working on one of my software projects, and I do not remember where
    173 a certain function is defined:
    174 
    175 ```
    176 $ grep -n "^apply_move(" src/*.c
    177 src/moves.c:206:apply_move(Move m, Cube cube)
    178 ```
    179 
    180 *Note: the command above works because, when I write C code, I write
    181 function names on a newline. See also
    182 [this older post](../2022-06-12-shell-ide-sed) for another example
    183 that takes advantage of this, this time using `sed`.*
    184 
    185 ### Grepping URLs
    186 
    187 Looking for URLs in a piece of text is a common enough operation
    188 for me that I saved it into a [script](https://git.tronto.net/scripts)
    189 for ease of use, that I called `urlgrep`.  URLs can be complicated,
    190 so for a long time I used a regular expression copied from somewhere
    191 on the internet.
    192 
    193 Now now that I am more familiar with `grep` and regular expressions, I have
    194 written my own - it does not work perfectly, but at least I understand it
    195 and I can keep tweaking it if I find errors.
    196 
    197 Let's build it together! What does a URL look like? It usually starts with
    198 either a *protocol* followed by a colon, or with `www.`. Then a bunch of
    199 valid characters follow. There are probably more rules to it, but to keep
    200 is simple we can start like this (using *extended* regular expressions):
    201 
    202 ```
    203 regex="(($protocols):|www\.)[$valid_chars]+"
    204 ```
    205 
    206 For protocols we can use
    207 
    208 ```
    209 protocols='http|https|ftp|sftp|gemini|mailto'
    210 ```
    211 
    212 I have thrown `mailto` in there because it is quite common in links web
    213 pages. The valid characters are:
    214 
    215 ```
    216 valid_chars="][a-zA-Z0-9_~/?#@!$&'()*+=.,;:-"
    217 ```
    218 
    219 (Yes, these ones I actually copied somewhere online). Finally we can
    220 find all URLs with
    221 
    222 ```
    223 $ egrep -o "$regex"
    224 ```
    225 
    226 As I mentioned above there are some problems with this. For example
    227 if a URL is not terminated by a space, the characters following it
    228 may be grepped too. For example:
    229 
    230 ```
    231 $ urlgrep <src/blog/2022-05-21-blogs/blogs.md
    232 https://en.wikipedia.org/wiki/Hypertext).
    233 https://caseymuratori.com/blog_0031)
    234 https://en.wikipedia.org/wiki/Netbook)
    235 https://developer.mozilla.org/en-US/Learn)
    236 https://www.romanzolotarev.com/website.html).
    237 ```
    238 
    239 This is not *technically* a problem, because parentheses and dots are allowed
    240 as part of a URL. But it is *practically* a problem, because most URLs will
    241 only contain matching pairs of parentheses.
    242 
    243 ## Conclusion
    244 
    245 `grep` is a must-know for anyone who wants to be proficient with the
    246 UNIX command line. Luckily, it is also pretty easy to learn.
    247 
    248 Moreover, being familiar with `grep` makes it easy to learn more
    249 advanced tools, such as `sed` and `awk`: the "read one line, process
    250 it, print something" idea is common to all three of them.
    251 
    252 Stay tuned for the part 2: `sed`!
    253 
    254 *Next in the series: [sed](../2023-12-03-sed)*