Page History

...

cut -f <field_number(s)> extracts one or more fields (-f) from each line of its input
- -d <delim> to change the field delimiter (Tab by default)
sort sorts its input using an efficient algorithm
- by default sorts each line lexically, but one or more fields to sort can be specified with one or more -k <field_number> options
- options to sort numerically (-n), or numbers-inside-text (version sort -V)
- -t <delim> to change the field delimiter (whitespace -- one or more spaces or Tabs – by default)
uniq -c counts groupings of its input (which must be sorted) and reports the text and count for each group
use cut | sort | uniq -c for a quick-and-dirty histogram (see piping a histogram)
Anchor
GREP
GREP
grep -P '<pattern>' searches for <pattern> in its input and outputs only lines containing it
always enclose <pattern> in single quotes to inhibit shell evaluation!
-P says use Perl patterns, which are much more powerful than standard grep patterns
-c says just return a count of line matches
-n says include the line number of the matching line
-v (inverse match) says return only lines not matching the pattern
-L says return only the names of files containing no pattern matches
-l says return only the names of files that do contain the mattern match
<pattern> can contain special match meta-characters and modifiers such as:
^ – matches beginning of line
$ – matches end of line
. – (period) matches any single character
* – modifier; place after an expression to match 0 or more occurrences
+ – modifier, place after an expression to match 1 or more occurrences
\s – matches any whitespace (\S any non-whitespace)
\d – matches digits 0-9
\w – matches any word character: A-Z, a-z, 0-9 and _ (underscore)
\t matches Tab; \r matches Carriage return; \n matches Linefeed
[xyz123] – matches any single character (including special characters) among those listed between the brackets [ ]
this is called a character class.
use [^xyz123] to match any single character not listed in the class
(Xyz|Abc) – matches either Xyz or Abc or any text or expressions inside parentheses separated by | characters
note that parentheses ( ) may also be used to capture matched sub-expressions for later use
Regular expression modules are available in nearly every programming language (Perl, Python, Java, PHP, awk, even R)
each "flavor" is slightly different
even bash has multiple regex commands: grep, egrep, fgrep.
There are many good online regular expression tutorials, but be sure to pick one tailored to the language you will use.
here's a good general one: https://www.regular-expressions.info/
and a perl regex tutorial: http://perldoc.perl.org/perlretut.html
perl regular expressions are the "gold standard" used in most other languages
Anchor
AWK_script
AWK_script
awk '<script>' a powerful scripting language that is easily invoked from the command line
<script> is applied to each line of input (generally piped in)
always enclose <script> in single quotes to inhibit shell evaluation
General structure of an awk script:
BEGIN{BEGIN <expressions>} – use to initialize variables before any script body lines are executed
e.g. BEGIN{BEGIN FS=":"; OFS="\t"; sum=0} says
use colon (:) as the input field separator (FS), and tab (\t) as the output field separator (OFS)
the default input field separator (FS) is whitespace
one or more spaces or tabs
the default output field separator (OFS) is a single space
initialize the variable sum to 0
{<body expressions>} – expressions to apply to each line of input
use $1, $2, etc. to pick out specific input fields
e.g. {print $3,$4} outputs fields 3 and 4 of the input, separated by the output field separator.
END{END <expressions>} – executed after all input is complete (e.g. print a sum)
Here is an excellent awk tutorial, very detailed and in-depth
take a look once you feel comfortable with the example scripts we've gone over in class.

...

Page tree

Versions Compared

Old Version 84

New Version 85

Key