cut.md (4135B)
1 # UNIX text filters, part 2.4 of 3: cut 2 3 *This post is part of a [series](../../series)* 4 5 Have you ever had to extract a bunch of data from a 6 [CSV](https://en.wikipedia.org/wiki/Comma-separated_values) file? 7 CSV is a common file format where multiple values are stored 8 in each line of a plain text file, separated by a comma or some 9 other separator. 10 In most cases it is quite a simple file format to deal with, unless 11 you want to write a generic parser that has to take into account 12 all the special cases. But let's say you just want to write a quick 13 and dirty shell script to read some values out of a single file. 14 With `cut` you can get the job done pretty quickly! 15 16 ## cut 17 18 Getting straight to the point, if you want to print columns 1, 3 19 and 4 of each line of `myfile.csv` you can use: 20 21 ``` 22 $ cut -f 1,3,4 -d , myfile.csv 23 ``` 24 25 Let's break this down. 26 27 ## Fields, characters and bytes 28 29 The `-f` option tells `cut` that you want to read lines field-by-field, 30 where fields are are separated by the argument to the `-d` option. 31 In our example the separator is a comma, but you can use any 32 character. If unspecified, the separator defaults to a TAB. 33 34 Instead of `-f` you could use `-c` (character) or `-b` (byte). If 35 you pick one of these, the separator is not to be specified, and 36 instead of field-by-field the rows are read character-by-character 37 or byte-by-byte. The difference between a byte and a character 38 depends on your 39 [locale](https://en.wikipedia.org/wiki/Locale_(computer_software)), 40 more specifically on the value of the environment variable `LC_CTYPE`. 41 42 ## Picking columns 43 44 Columns are 1-based, hence the argument `1,3,4` gets, surprise 45 surprise, the first, third and fourth columns of each line. The 46 order you write the column indices does not matter: if you write 47 `3,4,1` you still get the columns in the order they appear in the 48 original file. If you repeat some indices, e.g. `1,3,4,1`, the 49 repeated column is printed only once. 50 51 You can also use ranges: for example `1,2,5-10` will print the first 52 column, the second, and all the ones from the fifth to the tenth; 53 as another example, `-3` will print the first 3 columns - unbounded 54 ranges are interpreted as "from the start" and "until the end". 55 56 ## Examples 57 58 Let see some examples! 59 60 ### Simple csv parsing 61 62 Let's say `myfile.csv` is the following: 63 64 ``` 65 2024-01-13,-,4.50,out 66 2024-02-04,groceries,52.42,out 67 2024-02-20,reimbursement,89.99,in 68 2024-03-10,stuff,1.01,out 69 ``` 70 71 Then running the following command command: 72 73 ``` 74 $ cut -f 3,4 -d , myfile.csv 75 ``` 76 77 will result in: 78 79 ``` 80 4.50,out 81 52.42,out 82 89.99,in 83 1.01,out 84 ``` 85 86 ### Fixed-width table 87 88 Say you have a table like this in `table.txt`: 89 90 ``` 91 | WCA ID | Type | Result | Days | 92 --------------------------------------- 93 | 1982THAI01 | Single | 22.95 | 7749 | 94 | 2014CZAP01 | Single | 0.49 | 2443 | 95 | 2011TRON02 | Single | 16 | 1747 | 96 | 2015GORN01 | Single | 0.91 | 1673 | 97 | 2015DUYU01 | Single | 3.47 | 1660 | 98 | 2009ZEMD01 | Single | 6.88 | 1617 | 99 ``` 100 101 and you want to print out only the first and last columns. These 102 columns are from character 2 to 13 and 33 to 38 respectively, or 103 1-14 and 32-29 if you include the borders. So you can select them 104 with the `-b` or `-c` option (they are equivalent in this case) 105 like this: 106 107 ``` 108 $ cut -c 1-13,32-39 table.txt 109 ``` 110 111 and you will get: 112 113 ``` 114 | WCA ID | Days | 115 --------------------- 116 | 1982THAI01 | 7749 | 117 | 2014CZAP01 | 2443 | 118 | 2011TRON02 | 1747 | 119 | 2015GORN01 | 1673 | 120 | 2015DUYU01 | 1660 | 121 | 2009ZEMD01 | 1617 | 122 ``` 123 124 Since the ranges start at 1 and end at the last index, the following 125 command would produce the same result: 126 127 ``` 128 $ cut -c -13,32- table.txt 129 ``` 130 131 ## Conclusion 132 133 I have not used `cut` much until today, the main reason being that 134 the rare times I needed to parse a csv file I usually had to do 135 something more complicated with the data than just printing it out. 136 For this reason I have always relied on more complete languages, 137 like C or Python, rather than shell scripting. But `cut` is definitely 138 a convenient tool to be familiar with, given how simple it is! 139 140 *Next in the series: [expand and unexpand](../2024-04-07-expand-unexpand)*