grep.md (8286B)
1 # UNIX text filters, part 1 of 3: grep 2 3 After [the preliminary post on regular expressions](../2023-06-16-regex), 4 we are ready to begin this series on *text filters*. 5 6 This time we'll explore `grep`, the most simple kind of filter: 7 given a bunch of lines of text, print out only those that match a 8 certain criterion. 9 10 I will only describe a few basic options. All that I mention here 11 is POSIX-standard, with the exception of the option `-o`. This means 12 that the content of this post is valid in pretty much any UNIX-like 13 OS, but check your manual pages before copy-pasting my code - I can 14 always make mistakes. 15 16 Without further ado, let's dive in! 17 18 ## Standard usage 19 20 If you are familiar with how (UNIX) programs read from standard 21 output and write to standard output, the idea behing `grep` is 22 easily explained: the command 23 24 ``` 25 $ grep PATTERN 26 ``` 27 28 will read lines from standard input and write to standard output 29 only those that contain the given `PATTERN`. If you specify file 30 names after the pattern 31 32 ``` 33 $ grep PATTERN file1 file2 ... 34 ``` 35 36 `grep` will read those files instead of standard input. The `PATTERN` 37 can also be a [regular expression](../2023-06-16-regex). 38 39 In other words, you can use `grep` to look for certain pieces of 40 text in a file or in the output of another command. If you do not 41 understand all of this is about, start reading from the **Examples** 42 section below to get an idea. 43 44 Now let's see how you can tune `grep`'s behavior to your needs. 45 46 ### What to match: `-i`, `-v` 47 48 A common use of `grep`, especially for non-programming tasks, is 49 to look for occurrences of a specific word in a long text. In 50 this case one usually does not care if the word is all lowercase 51 or capitalized, for example because at the beginning of a sentence. 52 If you find yourself in this situation, you can use the `-i` option 53 to make `grep` case-insensitive. 54 55 Sometimes it easier to spell out what you *do not* want to match - 56 for example, say you want all non-empty lines of a given file. In 57 this case you can use the `-v` option to invert the behavior of 58 `grep`, such as: 59 60 ``` 61 $ grep -v "^$" file 62 ``` 63 64 Here `"^$"` is a regular expression that matches all lines where the 65 beginning of the line (in regex language, `^`) is immediately followed 66 by the end of the line (`$`); in other words, empty lines. 67 68 ### More on patterns: `-E`, `-e`, `-F`, `-f` 69 70 Up to now I have not specified what *kind* of regular expression 71 `grep` uses. By default it uses basic regular expressions, but it 72 uses extended regular expressions if called with the `-E` option. 73 Equivalentrly, you can use the command `egrep`. If you want to 74 turn off regular expressions altogether, you can use `grep -F` (or 75 `fgrep`). 76 77 If you want to select lines that match *any* of a number of patterns, 78 you can use the `-e` option: 79 80 ``` 81 $ grep -e PATTERN1 -e PATTERN2 -e ... [file1 file2 ...] 82 ``` 83 84 Alternatively you can write your pattern in a file, one per line, 85 and use: 86 87 ``` 88 $ grep -f PATTERN_FILE [file1 file2 ...] 89 ``` 90 91 ### Grepping multiple files: `-l`, `-n` 92 93 Sometimes I use `grep` to find occurrences of a certain string in 94 a bunch of files, for example with 95 96 ``` 97 $ grep "word" * 98 ``` 99 100 When used with multiple input files like this, `grep` will precede 101 each output line with the name of the file that contains it. If the 102 option `-n` is used, the line number is also shown. If `-l` is used, 103 only the name of the file is shown, and each file is shown at most 104 once. 105 106 If you do not want to print the file names at all, you can always 107 `cat` into `grep`: 108 109 ``` 110 $ cat file1 file2 ... | grep 111 ``` 112 113 But if anyone asks, you did not learn this from me - UUOC (Useless 114 Use Of Cat) is a considered a crime in some circles. 115 116 *Update 2023-09-02: I have just discovered that the the `-h` option can 117 be used to hide the file names, so no need for piping cats. However, 118 though present both in OpenBSD's and GNU's versions of `grep`, this 119 option is not POSIX standard.* 120 121 ### Matching only part of a line: `-o` 122 123 You may not always want the *full line* containing a piece of text. 124 Sometimes you just want a specific part of a line, and you know 125 exactly how to match it with a regular expression. In this case you can 126 use the `-o` option - we'll see an example below. 127 128 The `-o` is not POSIX-standard. It is ubiquitous though, and it 129 should be present in pretty much any version of `grep`. 130 131 ## Examples 132 133 Now that we now the basics, let's see some exciting applications 134 of `grep`! 135 136 Nah, I am kidding, they are not exciting. But they are useful. Boring, 137 but useful. 138 139 ### Filter command output 140 141 Probably my first use of `grep` was to filter out irrelevant part of 142 some command's output. Say for example you are troubleshooting a 143 problem with your webcam: you can use `dmesg` to check what your 144 operating system knows about it, but most of the output is useless 145 to your specific problem. No worries, you can pipe `dmesg` into 146 `grep`: 147 148 ``` 149 $ dmesg | grep video 150 acpivideo0 at acpi0: VGA_ 151 acpivout0 at acpivideo0: LCDD 152 uvideo0 at uhub0 port 6 configuration 1 interface 0 "JMICRON TECHNOLOGIES CO., LTD. USB2.0 UVC VGA WebCam" rev 2.00/2.04 addr 2 153 video0 at uvideo0 154 ``` 155 156 ### Look stuff up in files 157 158 Sometimes you may want to search something in a bunch of files. 159 Let's say for example I want to check in which of my old blog posts 160 I have mentioned "Linux": 161 162 ``` 163 $ grep -l Linux src/blog/*/* 164 src/blog/2022-05-29-man/man.md 165 src/blog/2022-08-14-website/website.md 166 src/blog/2022-09-10-netbooks/netbooks.md 167 src/blog/2023-01-28-windows-desktop/windows-desktop.md 168 src/blog/2023-02-25-job-control/job-control.md 169 src/blog/2023-02-25-job-control/jobs-diagram.pdf 170 ``` 171 172 Or say I am working on one of my software projects, and I do not remember where 173 a certain function is defined: 174 175 ``` 176 $ grep -n "^apply_move(" src/*.c 177 src/moves.c:206:apply_move(Move m, Cube cube) 178 ``` 179 180 *Note: the command above works because, when I write C code, I write 181 function names on a newline. See also 182 [this older post](../2022-06-12-shell-ide-sed) for another example 183 that takes advantage of this, this time using `sed`.* 184 185 ### Grepping URLs 186 187 Looking for URLs in a piece of text is a common enough operation 188 for me that I saved it into a [script](https://git.tronto.net/scripts) 189 for ease of use, that I called `urlgrep`. URLs can be complicated, 190 so for a long time I used a regular expression copied from somewhere 191 on the internet. 192 193 Now now that I am more familiar with `grep` and regular expressions, I have 194 written my own - it does not work perfectly, but at least I understand it 195 and I can keep tweaking it if I find errors. 196 197 Let's build it together! What does a URL look like? It usually starts with 198 either a *protocol* followed by a colon, or with `www.`. Then a bunch of 199 valid characters follow. There are probably more rules to it, but to keep 200 is simple we can start like this (using *extended* regular expressions): 201 202 ``` 203 regex="(($protocols):|www\.)[$valid_chars]+" 204 ``` 205 206 For protocols we can use 207 208 ``` 209 protocols='http|https|ftp|sftp|gemini|mailto' 210 ``` 211 212 I have thrown `mailto` in there because it is quite common in links web 213 pages. The valid characters are: 214 215 ``` 216 valid_chars="][a-zA-Z0-9_~/?#@!$&'()*+=.,;:-" 217 ``` 218 219 (Yes, these ones I actually copied somewhere online). Finally we can 220 find all URLs with 221 222 ``` 223 $ egrep -o "$regex" 224 ``` 225 226 As I mentioned above there are some problems with this. For example 227 if a URL is not terminated by a space, the characters following it 228 may be grepped too. For example: 229 230 ``` 231 $ urlgrep <src/blog/2022-05-21-blogs/blogs.md 232 https://en.wikipedia.org/wiki/Hypertext). 233 https://caseymuratori.com/blog_0031) 234 https://en.wikipedia.org/wiki/Netbook) 235 https://developer.mozilla.org/en-US/Learn) 236 https://www.romanzolotarev.com/website.html). 237 ``` 238 239 This is not *technically* a problem, because parentheses and dots are allowed 240 as part of a URL. But it is *practically* a problem, because most URLs will 241 only contain matching pairs of parentheses. 242 243 ## Conclusion 244 245 `grep` is a must-know for anyone who wants to be proficient with the 246 UNIX command line. Luckily, it is also pretty easy to learn. 247 248 Moreover, being familiar with `grep` makes it easy to learn more 249 advanced tools, such as `sed` and `awk`: the "read one line, process 250 it, print something" idea is common to all three of them. 251 252 Stay tuned for the part 2: `sed`! 253 254 *Next in the series: [sed](../2023-12-03-sed)*