...
- cut -f <field_number(s)> extracts one or more fields (-f) from each line of its input
- -d <delim> to change the field delimiter (Tab by default)
- sort sorts its input using an efficient algorithm
- by default sorts each line lexically
- one or more fields to sort can be specified with one or more -k <start_field_number>,<end_field_number> options
- has options to sort numerically (-n), or numbers-inside-text (version sort -V)
- -t <delim> to change the field delimiter (whitespace -- one or more spaces or Tabs – by default)
- by default sorts each line lexically
- uniq -c counts groupings of its input (which must be sorted) and reports the text and count for each group
- use cut | sort | uniq -c for a quick-and-dirty histogram (see piping a histogram)
grep -P '<pattern>' searches for <pattern> in its input and outputs only lines containing itAnchor GREP GREP - always enclose <pattern> in single quotes to inhibit shell evaluation!
- -P says use Perl patterns, which are much more powerful than standard grep patterns
- -c says just return a count of line matches
- -n says include the line number of the matching line
- -v (inverse match) says return only lines not matching the pattern
- -L says return only the names of files containing no pattern matches
- -l says return only the names of files that do contain the mattern match
- <pattern> can contain special match meta-characters and modifiers such as:
- ^ – matches beginning of line
- $ – matches end of line
- . – (period) matches any single character
- * – modifier; place after an expression to match 0 or more occurrences
- + – modifier, place after an expression to match 1 or more occurrences
- \s – matches any whitespace (\S any non-whitespace)
- \d – matches digits 0-9
- \w – matches any word character: A-Z, a-z, 0-9 and _ (underscore)
- \t matches Tab
- \r matches Carriage return
- \n matches Linefeed
- [xyz123] – matches any single character (including special characters) among those listed between the brackets [ ]
- this is called a character class.
- use [^xyz123] to match any single character not listed in the class
- (Xyz|Abc) – matches either Xyz or Abc or any text or expressions inside parentheses separated by | characters
- note that parentheses ( ) may also be used to capture matched sub-expressions for later use
- Regular expression modules are available in nearly every programming language (Perl, Python, Java, PHP, awk, even R)
- each "flavor" is slightly different
- even bash has multiple regex commands: grep, egrep, fgrep.
- This Wiki page: Tips and tricks#Regularexpressionsingrep,sedandperl from another CBRS course has more on using regular expressions on the command line, in grep, sed (string editor) and perl.
- There are many good online regular expression tutorials, but be sure to pick one tailored to the language you will use.
- here's a great step-by-step intro tutorial: https://regexone.com/here's a
- this Ryan's tutorial is also excellent: https://ryanstutorials.net/linuxtutorial/
- and another good general one: https://www.regular-expressions.info/
- and a perl regex tutorial: http://perldoc.perl.org/perlretut.html
- perl regular expressions are the "gold standard" used in most other languages
awk '<script>' a powerful scripting language that is easily invoked from the command lineAnchor AWK_script AWK_script - <script> is applied to each line of input (generally piped in)
- always enclose <script> in single quotes to inhibit shell evaluation
- General structure of an awk script:
- BEGIN{<expressions>} – use to initialize variables before any script body lines are executed
- e.g. BEGIN{FS=":"; OFS="\t"; sum=0} says
- use colon (:) as the input field separator (FS), and Tab (\t) as the output field separator (OFS)
- the default input field separator (FS) is whitespace
- one or more spaces or tabs
- the default output field separator (OFS) is a single space
- the default input field separator (FS) is whitespace
- initialize the variable sum to 0
- use colon (:) as the input field separator (FS), and Tab (\t) as the output field separator (OFS)
- e.g. BEGIN{FS=":"; OFS="\t"; sum=0} says
- {<body expressions>} – expressions to apply to each line of input
- use $1, $2, etc. to pick out specific input fields
- e.g. {print $3,$4} outputs fields 3 and 4 of the input, separated by the output field separator.
- END{<expressions>} – executed after all input is complete (e.g. print a sum)
- BEGIN{<expressions>} – use to initialize variables before any script body lines are executed
- Here is an excellent awk tutorial, very detailed and in-depth
- take a look once you feel comfortable with the example scripts we've gone over in class.
- <script> is applied to each line of input (generally piped in)
...