sebastiano.tronto.net

Source files and build scripts for my personal website
git clone https://git.tronto.net/sebastiano.tronto.net
Download | Log | Files | Refs | README

commit 7ce6e72ddfa408f6128cf95702b5b761d1ba8518
parent ec88cf30fc7c1b450821b9eaf9c177bc985d4c17
Author: Sebastiano Tronto <sebastiano@tronto.net>
Date:   Wed, 27 Mar 2024 17:50:07 +0100

Added tomorrow's blog post

Diffstat:
Msrc/blog/2024-03-27-rev/rev.md | 2++
Asrc/blog/2024-03-28-cut/cut.md | 138+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Msrc/series/series.md | 2++
3 files changed, 142 insertions(+), 0 deletions(-)

diff --git a/src/blog/2024-03-27-rev/rev.md b/src/blog/2024-03-27-rev/rev.md @@ -18,3 +18,5 @@ Since [text is complicated](https://www.youtube.com/watch?v=gd5uJ7Nlvvo), constitutes a character. And that's it. See you soon for another (longer) post in this series. + +*Next in the series: [cut](../2024-03-28-cut)* diff --git a/src/blog/2024-03-28-cut/cut.md b/src/blog/2024-03-28-cut/cut.md @@ -0,0 +1,138 @@ +# UNIX text filters part 2.4 of 3: cut + +*This post is part of a [series](../../series)* + +Have you ever had to extract a bunch of data from a +[CSV](https://en.wikipedia.org/wiki/Comma-separated_values) file? +CSV is a common file format where multiple values are stored +in each line of a plain text file, separated by a comma or some +other separator. +In most cases it is quite a simple file format to deal with, unless +you want to write a generic parser that has to take into account +all the special cases. But let's say you just want to write a quick +and dirty shell script to read some values out of a single file. +With `cut` you can get the job done pretty quickly! + +## cut + +Getting straight to the point, if you want to print columns 1, 3 +and 4 of each line of `myfile.csv` you can use: + +``` +$ cut -f 1,3,4 -d , myfile.csv +``` + +Let's break this down. + +## Fields, characters and bytes + +The `-f` option tells `cut` that you want to read lines field-by-field, +where fields are are separated by the argument to the `-d` option. +In our example the separator is a comma, but you can use any +character. If unspecified, the separator defaults to a TAB. + +Instead of `-f` you could use `-c` (character) or `-b` (byte). If +you pick one of these, the separator is not to be specified, and +instead of field-by-field the rows are read character-by-character +or byte-by-byte. The difference between a byte and a character +depends on your +[locale](https://en.wikipedia.org/wiki/Locale_(computer_software)), +more specifically on the value of the environment variable `LC_CTYPE`. + +## Picking columns + +Columns are 1-based, hence the argument `1,3,4` gets, surprise +surprise, the first, third and fourth columns of each line. The +order you write the column indices does not matter: if you write +`3,4,1` you still get the columns in the order they appear in the +original file. If you repeat some indices, e.g. `1,3,4,1`, the +repeated column is printed only once. + +You can also use ranges: for example `1,2,5-10` will print the first +column, the second, and all the ones from the fifth to the tenth; +as another example, `-3` will print the first 3 columns - unbounded +ranges are interpreted as "from the start" and "until the end". + +## Examples + +Let see some examples! + +### Simple csv parsing + +Let's say `myfile.csv` is the following: + +``` +2024-01-13,-,4.50,out +2024-02-04,groceries,52.42,out +2024-02-20,reimbursement,89.99,in +2024-03-10,stuff,1.01,out +``` + +Then running the following command command: + +``` +$ cut -f 3,4 -d , myfile.csv +``` + +will result in: + +``` +4.50,out +52.42,out +89.99,in +1.01,out +``` + +### Fixed-width table + +Say you have a table like this in `table.txt`: + +``` +| WCA ID | Type | Result | Days | +--------------------------------------- +| 1982THAI01 | Single | 22.95 | 7749 | +| 2014CZAP01 | Single | 0.49 | 2443 | +| 2011TRON02 | Single | 16 | 1747 | +| 2015GORN01 | Single | 0.91 | 1673 | +| 2015DUYU01 | Single | 3.47 | 1660 | +| 2009ZEMD01 | Single | 6.88 | 1617 | +``` + +and you want to print out only the first and last columns. These +columns are from character 2 to 13 and 33 to 38 respectively, or +1-14 and 32-29 if you include the borders. So you can select them +with the `-b` or `-c` option (they are equivalent in this case) +like this: + +``` +$ cut -c 1-13,32-39 table.txt +``` + +and you will get: + +``` +| WCA ID | Days | +--------------------- +| 1982THAI01 | 7749 | +| 2014CZAP01 | 2443 | +| 2011TRON02 | 1747 | +| 2015GORN01 | 1673 | +| 2015DUYU01 | 1660 | +| 2009ZEMD01 | 1617 | +``` + +Since the ranges start at 1 and end at the last index, the following +command would produce the same result: + +``` +$ cut -c -13,32- table.txt +``` + +## Conclusion + +I have not used `cut` much until today, the main reason being that +the rare times I needed to parse a csv file I usually had to do +something more complicated with the data than just printing it out. +For this reason I have always relied on more complete languages, +like C or Python, rather than shell scripting. But `cut` is definitely +a convenient tool to be familiar with, given how simple it is! diff --git a/src/series/series.md b/src/series/series.md @@ -31,6 +31,8 @@ of complexity: `grep`, `sed` and `awk`. Work in progress. * Part 2: [sed](../blog/2023-12-03-sed) * Part 2.1: [tr](../blog/2024-01-13-tr) * Part 2.2: [head and tail](../blog/2024-02-20-head-and-tail) +* Part 2.3: [rev](../blog/2024-03-27-rev) +* Part 2.4: [cut](../blog/2024-03-28-cut) * Part 3: awk (coming "soon") ## The UNIX shell as an IDE