sebastiano.tronto.net

Source files and build scripts for my personal website
git clone https://git.tronto.net/sebastiano.tronto.net
Download | Log | Files | Refs | README

commit e04ede76607de08e779f675d9161d0a61239c6b2
parent 35b57b07f63804ecf07b5c471a656a646b9d460a
Author: Sebastiano Tronto <sebastiano@tronto.net>
Date:   Sun, 20 Aug 2023 14:51:04 +0200

new blog post

Diffstat:
Asrc/blog/2023-08-20-grep/grep.md | 247+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 247 insertions(+), 0 deletions(-)

diff --git a/src/blog/2023-08-20-grep/grep.md b/src/blog/2023-08-20-grep/grep.md @@ -0,0 +1,247 @@ +# Unix text filters, part 1 of 3: grep + +After [the preliminary post on regular expressions](../2023-06-16-regex), +we are ready to begin this series on *text filters*. + +This time we'll explore `grep`, the most simple kind of filter: +given a bunch of lines of text, print out only those that match a +certain criterion. + +I will only describe a few basic options. All that I mention here +is POSIX-standard, with the exception of the option `-o`. This means +that the content of this post is valid in pretty much any UNIX-like +OS, but check your manual pages before copy-pasting my code - I can +always make mistakes. + +Without further ado, let's dive in! + +## Standard usage + +If you are familiar with how (UNIX) programs read from standard +output and write to standard output, the idea behing `grep` is +easily explained: the command + +``` +$ grep PATTERN +``` + +will read lines from standard input and write to standard output +only those that contain the given `PATTERN`. If you specify file +names after the pattern + +``` +$ grep PATTERN file1 file2 ... +``` + +`grep` will read those files instead of standard input. The `PATTERN` +can also be a [regular expression](../2023-06-16-regex). + +In other words, you can use `grep` to look for certain pieces of +text in a file or in the output of another command. If you do not +understand all of this is about, start reading from the **Examples** +section below to get an idea. + +Now let's see how you can tune `grep`'s behavior to your needs. + +### What to match: `-i`, `-v` + +A common use of `grep`, especially for non-programming tasks, is +to look for occurrences of a specific word in a long text. In +this case one usually does not care if the word is all lowercase +or capitalized, for example because at the beginning of a sentence. +If you find yourself in this situation, you can use the `-i` option +to make `grep` case-insensitive. + +Sometimes it easier to spell out what you *do not* want to match - +for example, say you want all non-empty lines of a given file. In +this case you can use the `-v` option to invert the behavior of +`grep`, such as: + +``` +$ grep -v "^$" file +``` + +Here `"^$"` is a regular expression that matches all lines where the +beginning of the line (in regex language, `^`) is immediately followed +by the end of the line (`$`); in other words, empty lines. + +### More on patterns: `-E`, `-e`, `-F`, `-f` + +Up to now I have not specified what *kind* of regular expression +`grep` uses. By default it uses basic regular expressions, but it +uses extended regular expressions if called with the `-E` option. +Equivalentrly, you can use the command `egrep`. If you want to +turn off regular expressions altogether, you can use `grep -F` (or +`fgrep`). + +If you want to select lines that match *any* of a number of patterns, +you can use the `-e` option: + +``` +$ grep -e PATTERN1 -e PATTERN2 -e ... [file1 file2 ...] +``` + +Alternatively you can write your pattern in a file, one per line, +and use: + +``` +$ grep -f PATTERN_FILE [file1 file2 ...] +``` + +### Grepping multiple files: `-l`, `-n` + +Sometimes I use `grep` to find occurrences of a certain string in +a bunch of files, for example with + +``` +$ grep "word" * +``` + +When used with multiple input files like this, `grep` will precede +each output line with the name of the file that contains it. If the +option `-n` is used, the line number is also shown. If `-l` is used, +only the name of the file is shown, and each file is shown at most +once. + +If you do not want to print the file names at all, you can always +`cat` into `grep`: + +``` +$ cat file1 file2 ... | grep +``` + +But if anyone asks, you did not learn this from me - UUOC (Useless +Use Of Cat) is a considered a crime in some circles. + +### Matching only part of a line: `-o` + +You may not always want the *full line* containing a piece of text. +Sometimes you just want a specific part of a line, and you know +exactly how to match it with a regular expression. In this case you can +use the `-o` option - we'll see an example below. + +The `-o` is not POSIX-standard. It is ubiquitous though, and it +should be present in pretty much any version of `grep`. + +## Examples + +Now that we now the basics, let's see some exciting applications +of `grep`! + +Nah, I am kidding, they are not exciting. But they are useful. Boring, +but useful. + +### Filter command output + +Probably my first use of `grep` was to filter out irrelevant part of +some command's output. Say for example you are troubleshooting a +problem with your webcam: you can use `dmesg` to check what your +operating system knows about it, but most of the output is useless +to your specific problem. No worries, you can pipe `dmesg` into +`grep`: + +``` +$ dmesg | grep video +acpivideo0 at acpi0: VGA_ +acpivout0 at acpivideo0: LCDD +uvideo0 at uhub0 port 6 configuration 1 interface 0 "JMICRON TECHNOLOGIES CO., LTD. USB2.0 UVC VGA WebCam" rev 2.00/2.04 addr 2 +video0 at uvideo0 +``` + +### Look stuff up in files + +Sometimes you may want to search something in a bunch of files. +Let's say for example I want to check in which of my old blog posts +I have mentioned "Linux": + +``` +$ grep -l Linux src/blog/*/* +src/blog/2022-05-29-man/man.md +src/blog/2022-08-14-website/website.md +src/blog/2022-09-10-netbooks/netbooks.md +src/blog/2023-01-28-windows-desktop/windows-desktop.md +src/blog/2023-02-25-job-control/job-control.md +src/blog/2023-02-25-job-control/jobs-diagram.pdf +``` + +Or say I am working on one of my software projects, and I do not remember where +a certain function is defined: + +``` +$ grep -n "^apply_move(" src/*.c +src/moves.c:206:apply_move(Move m, Cube cube) +``` + +*Note: the command above works because, when I write C code, I write +function names on a newline. See also +[this older post](../2022-06-12-shell-ide-sed) for another example +that takes advantage of this, this time using `sed`.* + +### Grepping URLs + +Looking for URLs in a piece of text is a common enough operation +for me that I saved it into a [script](https://git.tronto.net/scripts) +for ease of use, that I called `urlgrep`. URLs can be complicated, +so for a long time I used a regular expression copied from somewhere +on the internet. + +Now now that I am more familiar with `grep` and regular expressions, I have +written my own - it does not work perfectly, but at least I understand it +and I can keep tweaking it if I find errors. + +Let's build it together! What does a URL look like? It usually starts with +either a *protocol* followed by a colon, or with `www.`. Then a bunch of +valid characters follow. There are probably more rules to it, but to keep +is simple we can start like this (using *extended* regular expressions): + +``` +regex="(($protocols):|www\.)[$valid_chars]+" +``` + +For protocols we can use + +``` +protocols='http|https|ftp|sftp|gemini|mailto' +``` + +I have thrown `mailto` in there because it is quite common in links web +pages. The valid characters are: + +``` +valid_chars="][a-zA-Z0-9_~/?#@!$&'()*+=.,;:-" +``` + +(Yes, these ones I actually copied somewhere online). Finally we can +find all URLs with + +``` +$ egrep -o "$regex" +``` + +As I mentioned above there are some problems with this. For example +if a URL is not terminated by a space, the characters following it +may be grepped too. For example: + +``` +$ urlgrep <src/blog/2022-05-21-blogs/blogs.md +https://en.wikipedia.org/wiki/Hypertext). +https://caseymuratori.com/blog_0031) +https://en.wikipedia.org/wiki/Netbook) +https://developer.mozilla.org/en-US/Learn) +https://www.romanzolotarev.com/website.html). +``` + +This is not *technically* a problem, because parentheses and dots are allowed +as part of a URL. But it is *practically* a problem, because most URLs will +only contain matching pairs of parentheses. + +## Conclusion + +`grep` is a must-know for anyone who wants to be proficient with the +UNIX command line. Luckily, it is also pretty easy to learn. + +Moreover, being familiar with `grep` makes it easy to learn more +advanced tools, such as `sed` and `awk`: the "read one line, process +it, print something" idea is common to all three of them. + +Stay tuned for the part 2: `sed`!