cut.md - sebastiano.tronto.net - Source files and build scripts for my personal website

cut.md (4135B)
      1 # UNIX text filters, part 2.4 of 3: cut
      2 
      3 *This post is part of a [series](../../series)*
      4 
      5 Have you ever had to extract a bunch of data from a
      6 [CSV](https://en.wikipedia.org/wiki/Comma-separated_values) file?
      7 CSV is a common file format where multiple values are stored
      8 in each line of a plain text file, separated by a comma or some
      9 other separator.
     10 In most cases it is quite a simple file format to deal with, unless
     11 you want to write a generic parser that has to take into account
     12 all the special cases. But let's say you just want to write a quick
     13 and dirty shell script to read some values out of a single file.
     14 With `cut` you can get the job done pretty quickly!
     15 
     16 ## cut
     17 
     18 Getting straight to the point, if you want to print columns 1, 3
     19 and 4 of each line of `myfile.csv` you can use:
     20 
     21 ```
     22 $ cut -f 1,3,4 -d , myfile.csv
     23 ```
     24 
     25 Let's break this down.
     26 
     27 ## Fields, characters and bytes
     28 
     29 The `-f` option tells `cut` that you want to read lines field-by-field,
     30 where fields are are separated by the argument to the `-d` option.
     31 In our example the separator is a comma, but you can use any
     32 character.  If unspecified, the separator defaults to a TAB.
     33 
     34 Instead of `-f` you could use `-c` (character) or `-b` (byte). If
     35 you pick one of these, the separator is not to be specified, and
     36 instead of field-by-field the rows are read character-by-character
     37 or byte-by-byte. The difference between a byte and a character
     38 depends on your
     39 [locale](https://en.wikipedia.org/wiki/Locale_(computer_software)),
     40 more specifically on the value of the environment variable `LC_CTYPE`.
     41 
     42 ## Picking columns
     43 
     44 Columns are 1-based, hence the argument `1,3,4` gets, surprise
     45 surprise, the first, third and fourth columns of each line. The
     46 order you write the column indices does not matter: if you write
     47 `3,4,1` you still get the columns in the order they appear in the
     48 original file. If you repeat some indices, e.g. `1,3,4,1`, the
     49 repeated column is printed only once.
     50 
     51 You can also use ranges: for example `1,2,5-10` will print the first
     52 column, the second, and all the ones from the fifth to the tenth;
     53 as another example, `-3` will print the first 3 columns - unbounded
     54 ranges are interpreted as "from the start" and "until the end".
     55 
     56 ## Examples
     57 
     58 Let see some examples!
     59 
     60 ### Simple csv parsing
     61 
     62 Let's say `myfile.csv` is the following:
     63 
     64 ```
     65 2024-01-13,-,4.50,out
     66 2024-02-04,groceries,52.42,out
     67 2024-02-20,reimbursement,89.99,in
     68 2024-03-10,stuff,1.01,out
     69 ```
     70 
     71 Then running the following command command:
     72 
     73 ```
     74 $ cut -f 3,4 -d , myfile.csv
     75 ```
     76 
     77 will result in:
     78 
     79 ```
     80 4.50,out
     81 52.42,out
     82 89.99,in
     83 1.01,out
     84 ```
     85 
     86 ### Fixed-width table
     87 
     88 Say you have a table like this in `table.txt`:
     89 
     90 ```
     91 |   WCA ID   |  Type  | Result | Days |
     92 ---------------------------------------
     93 | 1982THAI01 | Single |  22.95 | 7749 |
     94 | 2014CZAP01 | Single |   0.49 | 2443 |
     95 | 2011TRON02 | Single |     16 | 1747 |
     96 | 2015GORN01 | Single |   0.91 | 1673 |
     97 | 2015DUYU01 | Single |   3.47 | 1660 |
     98 | 2009ZEMD01 | Single |   6.88 | 1617 |
     99 ```
    100 
    101 and you want to print out only the first and last columns. These
    102 columns are from character 2 to 13 and 33 to 38 respectively, or
    103 1-14 and 32-29 if you include the borders. So you can select them
    104 with the `-b` or `-c` option (they are equivalent in this case)
    105 like this:
    106 
    107 ```
    108 $ cut -c 1-13,32-39 table.txt
    109 ```
    110 
    111 and you will get:
    112 
    113 ```
    114 |   WCA ID   | Days |
    115 ---------------------
    116 | 1982THAI01 | 7749 |
    117 | 2014CZAP01 | 2443 |
    118 | 2011TRON02 | 1747 |
    119 | 2015GORN01 | 1673 |
    120 | 2015DUYU01 | 1660 |
    121 | 2009ZEMD01 | 1617 |
    122 ```
    123 
    124 Since the ranges start at 1 and end at the last index, the following
    125 command would produce the same result:
    126 
    127 ```
    128 $ cut -c -13,32- table.txt
    129 ```
    130 
    131 ## Conclusion
    132 
    133 I have not used `cut` much until today, the main reason being that
    134 the rare times I needed to parse a csv file I usually had to do
    135 something more complicated with the data than just printing it out.
    136 For this reason I have always relied on more complete languages,
    137 like C or Python, rather than shell scripting. But `cut` is definitely
    138 a convenient tool to be familiar with, given how simple it is!
    139 
    140 *Next in the series: [expand and unexpand](../2024-04-07-expand-unexpand)*
	sebastiano.tronto.net Source files and build scripts for my personal website
	git clone https://git.tronto.net/sebastiano.tronto.net
	Download \| Log \| Files \| Refs \| README