sebastiano.tronto.net

Source files and build scripts for my personal website
git clone https://git.tronto.net/sebastiano.tronto.net
Download | Log | Files | Refs | README

commit 555bbbac1a475639b21a8ceebdd3cf0a06c742d3
parent 1cddcc1dc5c24a87ca464eceddcf35745ed89c7c
Author: Sebastiano Tronto <sebastiano@tronto.net>
Date:   Sat, 29 Jul 2023 23:09:08 +0200

Merge branch 'master' of tronto.net:sebastiano.tronto.net

Diffstat:
Asrc/blog/2023-06-16-regex/regex.md | 201+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Asrc/blog/2023-07-11-feed/feed.md | 340+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Msrc/blog/blog.md | 2++
Msrc/blog/feed.xml | 14++++++++++++++
4 files changed, 557 insertions(+), 0 deletions(-)

diff --git a/src/blog/2023-06-16-regex/regex.md b/src/blog/2023-06-16-regex/regex.md @@ -0,0 +1,201 @@ +# UNIX text filters, part 0 of 3: regular expressions + +One of the most important features of UNIX and its descendants, if +not *the* most important feature, is input / output redirection: +the output of a command can be displayed to the user, written to a +file or used as the input for another command seamlessly, without +the the program knowing which of these things is happening. This +is possible because most UNIX programs use *plain text* as their +input/output language, which is understood equally well by the three +types of users - humans, files and other running programs. + +Since this is such a fundamental feature of UNIX, I thought it would +be nice to go through some of the standard tools that help the user +take advantage of it. At first I thought of doing this as part of +my *man page reading club* series, but in the end I decided to give +them their own space. My other series has also been going on for +more than a year now, so it is a good time to end it and start a +new one. + +Let me then introduce you to: **UNIX text filters**. + +## Text filters + +For the purpose of this blog series, a *text filter* is a program +that reads plain text from standard input and writes a modified, +or *filtered* version of the same text to standard output. According +to the introductory paragraph, this definition includes most UNIX +programs; but we are going to focus on the following three, in +increasing order of complexity: + +* grep +* sed +* awk + +In order to unleash the true power of these tools, we first need +to grasp the basics of +[regular expressions](https://en.wikipedia.org/wiki/Regular_expression). +And what better way to do it than following the dedicated +[OpenBSD manual page](https://man.openbsd.org/OpenBSD-7.3/re_format)? + +## (Extended) regular expressions + +Regular expressions, or regexes for short, are a convenient way to +describe text patterns. They are commonly used to solve genering +string matching problems, such as determining if a given piece +of text is a valid URL. Many standard UNIX tools, including the three +we are going to cover in this series, support regexes. + +Let's deal with the nasty part first: even within POSIX, there is +not one single standard for regular expressions; there are at least +two of them: Basic Regular Expressions (BREs) and Extended Regular +Expressions (ERE). As it always happens when there is more than one +standard for the same thing, other people decided to come up with +another version to replace all previous "standards", so we have also +[PCREs](https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions), +and probably more. [Things got out of hand quickly](https://xkcd.com/927). + +In this post I am going to follow the structure of +[re_format(7)](https://man.openbsd.org/OpenBSD-7.3/re_format) and +present *extended* regular expresssions first. After that I'll point +out the differences with *basic* regular expressions. + +The goal is not to provide a complete guide to regexes, but rather +an introduction to the most important features, glossing over the +nasty edge-cases. Keep also in mind that I am in no way an expert +on the subject: we are learning together, here! + +### The basics + +You can think of a regular expression as a *pattern*, or a *rule*, +that describes which strings are "valid" (they are *matched* by the +regular expression) and which are not. As a trivial example, the +regular expression `hello` matches only the string "hello". A less +trivial example is the regex `.*` that matches *any* string. I'll +explain why in a second. + +Beware not to confuse regular expressions with *shell globs*, i.e. +the rules for shell command expansion. Although they use similar +symbols to achieve a similar goal, they are not the same thing. See +[my post on sh(1)](../2022-09-13-sh-1) or +[glob(7)](https://man.openbsd.org/OpenBSD-7.3/glob.7) for an +explanation on shell globs. + +### General structure and terminology + +A general regex looks something like this: + +``` +piece piece piece ... | piece piece piece ... | ... +``` + +A sequence of *pieces* is called a *branch*, and a regex is a +sequence of branches separated by pipes `|`. Pieces are not separated +by spaces, they are simply concatenated. + +The pipes `|` are read "or": a regex matches a given string if any +of its branches does. A branch matches a given string if the latter +can be written as a sequence of strings, each matching one of the +pieces, in the given order. + +Before going into what pieces are exactly, consider the following +example: + +``` +hello|world +``` + +This regex matches both the string "hello" and the string "world", +and nothing else. The pieces are the single letters composing the +two words, and as you can see they are juxtaposed without spaces. + +But what else is a valid piece? In general, a piece is made up of +an *atom*, optionally followed by a *multiplier*. + +### Atoms + +As we have already seen, the most simple kind of atom is a single +character. The most *general* kind of atom, on the other hand, is +a whole regular expression enclosed in parentheses `()`. Yes, regexes +are recursive. + +There are some special characters: for example, a single dot `.` +matches *any* single character. The characters `^` and `$` match +an empty string at the beginning and at the end of a line, respectively. +If you want to match a special character as if it was regular, say +because you want to match strings that represent values in the +dollar currency, you can *escape* them with a backslash. For example +`\$` matches the string "$". + +The last kind of atoms are *bracket expressions*, which consist of +lists of characters enclosed in brackets `[]`. A simple list of +characters in brackets, like `[xyz]`, matches any character in the +list, unless the first character is a `^`, in which case it matches +every character *not* in the list. Two characters separated by a +dash `-` denote a range: for example `[a-z]` matches every lowercase +letter and `[1-7]` matches all digits from 1 to 7. + +You can also use cetain special sets of characters, like `[:lower:]` +to match every lowercase letter (same as `[a-z]`), `[:alnum:]` to +match every alphanumeric character or `[:digit:]` to match every +decimal digit. Check the +[man page](https://man.openbsd.org/OpenBSD-7.3/re_format) +for the full list. + +### Multipliers + +The term "multiplier" does not appear anywhere in the manual page, I +made it up. But I think it fits, so I'll keep using it. + +Multipliers allow you to match an atom repeating a specified or +unspecified amount of times. The most general one is the *bound* +multiplier, which consists of one or two comma-separated numbers +enclosed in braces `{}`. + +In its most simple form, the multiplier `{n}` repeats the multiplied +atom `n` times. For example, the regex `a{7}` is equivalent to the +regex `aaaaaaa` (and it matches the string "aaaaaaa"). + +The form `{n,m}` matches *any number* between `n` and `m` of copies +of the preceeding atom. For example `a{2,4}` is equivalent to +`aa|aaa|aaaa`. If the integer `m` is not specified, the multiplied +atom matches any string that consists of *at least* `n` copies of +the atom. + +Now we can explain very quickly the more common multipliers `+`, +`*` and `?`: they are equivalent to `{1,}`, `{0,}` and `{0,1}` +respectively. That is to say, `+` matches at least one copy of the +atom, `*` matches any number of copies (including none) and `?` +matches either one copy or none. + +## Basic regular expressions + +Basic regular expressions are less powerful than their extended +counterpart (with one exception, see below) and require more +backslashes, but it is worth knowing them, because they are used +by default in some programs (for example [ed(1)](../2022-12-24-ed)). +The main differences between EREs and BREs are: + +* BREs consist of one single branch, i.e. there is no `|`. +* Multipliers `+` and `?` do not exist. +* You need to escape parentheses `\(\)` and braces `\{\}` to + use them with their special meaning. + +There is one feature of BREs, called *back-reference*, that is +absent in EREs. Apparently it makes the implementation much more +complex, and it makes BREs more powerful. I noticed the author of +the manual page despises back-references, so I am not going to learn +them out of respect for them. + +## Conclusion + +Regexes are a powerful tool, and they are more than worth knowing. +But, quoting from the manual page: + +``` + Having two kinds of REs is a botch. +``` + +I hope you enjoyed this post, despite the lack of practical examples. +If you want to see more applications of regular expressions, stay +tuned for the next entries on grep, sed and awk! diff --git a/src/blog/2023-07-11-feed/feed.md b/src/blog/2023-07-11-feed/feed.md @@ -0,0 +1,340 @@ +# My minimalistic RSS feed setup + +A couple of years ago I started using +[RSS](https://en.wikipedia.org/wiki/Rss) +(or [atom](https://en.wikipedia.org/wiki/Atom_(standard))) +feeds to stay up to date with websites and blogs I wanted to read. +This method is more convenient than what I used before (i.e. open +Firefox and open each website I want to follow in a new tab, one +by one), but unfortunately not every website provides an RSS feed +these days. + +At first I used [newsboat](https://newsboat.org), but I soon started +disliking the curses interface - see also my rant on curses at the +end of [this other blog post](../2022-12-24-ed). Then I discovered +`sfeed`. + +## sfeed + +[`sfeed`](https://codemadness.org/sfeed-simple-feed-parser.html) +is an extremely minimalistic RSS and atom reader: it reads +the xml content of feed file from standard input and it outputs one line per +feed item, with tab-separated timestamps, title, link and so on. This tool +comes bundled with other commands that can be combined with it, such as +`sfeed_plain`, which converts the output of sfeed into something +more readable: + +``` +$ curl -L https://sebastiano.tronto.net/blog/feed.xml | sfeed | sfeed_plain + 2023-06-16 02:00 UNIX text filters, part 0 of 3: regular expressions https://sebastiano.tronto.net/blog/2023-06-16-regex + 2023-05-05 02:00 I had to debug C code on a smartphone https://sebastiano.tronto.net/blog/2023-05-05-debug-smartphone + 2023-04-10 02:00 The big rewrite https://sebastiano.tronto.net/blog/2023-04-10-the-big-rewrite + 2023-03-30 02:00 The man page reading club: dc(1) https://sebastiano.tronto.net/blog/2023-03-30-dc + 2023-03-06 01:00 Resizing my website's pictures with ImageMagick and find(1) https://sebastiano.tronto.net/blog/2023-03-06-resize-pictures +... +``` + +One can also write a configuration file with all the desired feeds +and fetch them with `sfeed_update`, or even use the `sfeed_curses` +UI. But the reasons I tried out `sfeed` in the first place is that +I *did not* want to use a curses UI, so I decided to stick with +`sfeed_plain`. + +## My wrapper script - old versions + +In the project's homepage the following short script is presented to +demonstrate the flexibility of sfeed: + +``` +#!/bin/sh +url=$(sfeed_plain "$HOME/.sfeed/feeds/"* | dmenu -l 35 -i | \ + sed -n 's@^.* \([a-zA-Z]*://\)\(.*\)$@\1\2@p') +test -n "${url}" && $BROWSER "${url}" +``` + +The first line shows a list of feed items in +[dmenu](https://tools.suckless.org/dmenu) +to let the user select one, the second line opens the selected item +in a web browser. I was impressed by how simple and clever this +example was, and I decided to expand on it to build "my own" feed +reader UI. + +In the first version I made, my feeds were separated in folders, +one per file, and one could select multiple feeds or even entire +folders via dmenu using +[dmenu-filepicker](https://git.tronto.net/scripts/file/dmenu-filepicker.html) +for file selection. +Once the session was terminated, all shown feeds were marked as +"read" by writing the timestamp of the last read item on a cache +file, and they were not shown again on successive calls. + +This system worked fine for me, but at some point I grew tired of +feeds being marked as "read" automatically. I also disliked the +complexity of my own script. So I rewrote it from scratch, giving +up the idea of marking feeds as read. This second version can still +be found in the *old* folder of my +[scripts repo](https://git.tronto.net/scripts), but I may remove it +in the future. You will still be able to find it in the git history. + +I have happily used this second version for more than a year, but +I had some minor issues with it. The main one was that, as I started +adding more and more websites to my feed list, fetching them took +longer and longer - up to 20-30 seconds; while the feed was loading, +I could not start doing other stuff, because later dmenu would have +grapped my keyboard while I was typing. Moreover, having a way to +filter out old feed items is kinda useful when you check your feed +relatively often. A few weeks ago I had enough and I decided to +rewrite my wrapper script once again. + +## My wrapper script - current version + +In its current version, my `feed` scripts accepts four sub-commands: +`get` to update the feed, `menu` to prompt a dmenu selection, `clear` +to remove the old items and `show` to list all the new items. +Since `clear` is a separate action, I do not have the problem I +used to have with my first version, i.e. that feeds are automatically +marked as read even if I sometimes do not want them to be. + +Let's walk through my last iteration on this script - you can find +it in my scripts repository, but I'll include it at the end of this +section too. + +At first I define some variables (mostly filenames), so that I can +easily adapt the script if one day I want to move stuff around: + +``` +dir=$HOME/box/sfeed +feeddir=$dir/urls +destdir=$dir/new +olddir=$dir/old +readdir=$dir/last +menu="dmenu -l 20 -i" +urlopener=open-url +``` + +Here `open-url` is another one of my utility scripts. + +To update the feed, I loop over the files in my feed folder. Each +file contains a single line with the feed's url, and the name of +the file is the name / title of the website. The results of `sfeed` +are piped into `sfeed_plain` and then saved to a file, and the most +recent time stamp for each feed is updated. + +``` +getnew() { + for f in "$feeddir"/*; do + read -r url < "$f" + name=$(basename "$f") + d="$destdir/$name" + r="$readdir/$name" + + [ -f "$r" ] && read -r lr < "$r" || lr=0 + + # Get new feed items + tmp=$(mktemp) + curl -s "$url" | sfeed | \ + awk -v lr="$lr" '$1 > lr {print $0}' | \ + tee "$tmp" | sfeed_plain >> "$d" + + # Update last time stamp + awk -v lr="$lr" '$1 > lr {lr=$1} END {print lr}' <"$tmp" >"$r" + done +} +``` + +The next snippet is used to show the new feed items. +The `for` loop could be replaced by a simple +`cat "$destdir"/*`, but I also want to prepend each line with +the name of the website. + +``` +show() { + for f in "$destdir"/*; do + ff=$(basename "$f") + if [ -s "$f" ]; then + while read -r line; do + printf '%20s %s\n' "$ff" "$line" + done < "$f" + fi + done +} +``` + +Finally, the following one-liner can be used to prompt the user to +select and open the desired items in a browser using dmenu: + +``` +selectmenu() { + $menu | awk '{print $NF}' | xargs $urlopener +} +``` + +The "clear" action is a straightfortward file management routine, +and the rest of the script is just shell boilerplate code to parse +the command line options and sub-commands. Putting it all together, +the script looks like this: + +``` +#!/bin/sh + +# RSS feed manager + +# Requires: sfeed, sfeed_plain (get), dmenu, open-url (menu) + +# Usage: feed [-m menu] [get|menu|clear|show] + +dir=$HOME/box/sfeed +feeddir=$dir/urls +destdir=$dir/new +olddir=$dir/old +readdir=$dir/last +menu="dmenu -l 20 -i" +urlopener=open-url + +usage() { + echo "Usage: feed [get|menu|clear|show]" +} + +getnew() { + for f in "$feeddir"/*; do + read -r url < "$f" + name=$(basename "$f") + d="$destdir/$name" + r="$readdir/$name" + + [ -f "$r" ] && read -r lr < "$r" || lr=0 + + # Get new feed items + tmp=$(mktemp) + curl -s "$url" | sfeed | \ + awk -v lr="$lr" '$1 > lr {print $0}' | \ + tee "$tmp" | sfeed_plain >> "$d" + + # Update last time stamp + awk -v lr="$lr" '$1 > lr {lr=$1} END {print lr}' <"$tmp" >"$r" + done +} + +show() { + for f in "$destdir"/*; do + ff=$(basename "$f") + if [ -s "$f" ]; then + while read -r line; do + printf '%20s %s\n' "$ff" "$line" + done < "$f" + fi + done +} + +selectmenu() { + $menu | awk '{print $NF}' | xargs $urlopener +} + +while getopts "m:" opt; do + case "$opt" in + m) + menu="$OPTARG" + ;; + *) + usage + exit 1 + ;; + esac +done + +shift $((OPTIND - 1)) + +if [ -z "$1" ]; then + usage + exit 1 +fi + +case "$1" in + get) + getnew + countnew=$(cat "$destdir"/* | wc -l) + echo "$countnew new feed items" + ;; + menu) + show | selectmenu + ;; + clear) + d="$olddir/$(date +'%Y-%m-%d-%H-%M-%S')" + mkdir "$d" + mv "$destdir"/* "$d/" + ;; + show) + show + ;; + *) + usage + exit 1 + ;; +esac +``` + +I personally like this approach of taking a simple program that +only uses standard output and standard input and wrapping it around +a shell script to have it do exactly what I want. The bulk of the +work is done the "black box" program, and the shell scripts glues +it together with the "configuration" files (in this case, my feed +folder) and presents the results to me, interactively (e.g. via +dmenu) or otherwise. + +At this point my feed-comsumption workflow would be something like +this: first I `feed get`, then I do other stuff while the feed loads +and later, after a couple of minutes or so, I run a `feed show` or +`feed menu`. This is still not ideal, because whenever I want to +check my feeds I still have to wait for them to be downloaded. The +only way to go around it would be to have `feed get` run automatically +when I am not thinking about it... + +## Setting up a cron job + +My personal laptop is not always connected to the internet, and in +general I do not like having too many network-related jobs running +in the background. But I do have a machine that is always connected +to the internet: the VM instance hosting this website. + +Since my new setup saves my feed updates to local files, I can have +a [cron job](https://en.wikipedia.org/wiki/Cron_job) fetch the new +items and update files in a folder sync'd via +[syncthing](https://syncthing.net) (yes, I do have that *one* network +service constantly running in the background...). This setup is +similar to the one I use to [fetch my email](../2022-10-19-email-setup). + +I rarely use cron, and I am always a little intimitaded by its +syntax. But in the end to have `feed get` run every hour I just +needed to add the following two lines via `crontab -e`: + +``` +MAILTO="" +0 * * * * feed get +``` + +This is my definitive new setup, and I like it. It also has the +advantage that I only need to install `sfeed` on my server and not +locally, though I prefer to still keep it around. + +So far I have found one little caveat: if my feed gets updated after +I read it and before I run a `feed clear`, some items may be deleted +before I see them. This is easilly worked around by running a quick +`feed show` before I clear the feeds up, but it is still worth +keeping in mind. + +## Conclusions + +This is a summary of my last script-crafting adventure. As I was +writing this post I realized I could probably use `sfeed_update` +to simplify the script a bit, since I do not separate feeds into +folders anymore. I have also found out that `sfeed_mbox` was created +(at least I *think* it was not there the last time I checked) and I +could use it to browse my feed with a mail client - see also +[this video tutorial](https://josephchoe.com/rss-terminal) for a demo. + +With all of this, did I solve my problem in the best possible way? +Definitely not. But does it work for me? Absolutely! Did I learn +something new while doing this? Kind of, but mostly I have just +excercised skills that I already had. + +All in all, it was a fun exercise. diff --git a/src/blog/blog.md b/src/blog/blog.md @@ -5,6 +5,8 @@ ## 2023 +* 2023-07-11 [My minimalistic RSS feed setup](2023-07-11-feed) +* 2023-06-16 [UNIX text filters, part 0 of 3: regular expressions](2023-06-16-regex) * 2023-05-05 [I had to debug C code on a smartphone](2023-05-05-debug-smartphone) * 2023-04-10 [The big rewrite](2023-04-10-the-big-rewrite) * 2023-03-30 [The man page reading club: dc(1)](2023-03-30-dc) diff --git a/src/blog/feed.xml b/src/blog/feed.xml @@ -9,6 +9,20 @@ Thoughts about software, computers and whatever I feel like sharing </description> <item> +<title>My minimalistic RSS feed setup</title> +<link>https://sebastiano.tronto.net/blog/2023-07-11-feed</link> +<description>My minimalistic RSS feed setup</description> +<pubDate>2023-07-11</pubDate> +</item> + +<item> +<title>UNIX text filters, part 0 of 3: regular expressions</title> +<link>https://sebastiano.tronto.net/blog/2023-06-16-regex</link> +<description>UNIX text filters, part 0 of 3: regular expressions</description> +<pubDate>2023-06-16</pubDate> +</item> + +<item> <title>I had to debug C code on a smartphone</title> <link>https://sebastiano.tronto.net/blog/2023-05-05-debug-smartphone</link> <description>I had to debug C code on a smartphone</description>