Page tree

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


  • cut -f <field_number(s)> extracts one or more fields (-f) from each line of its input
    • -d <delim> to change the field delimiter (Tab by default)
  • sort sorts its input using an efficient algorithm
    • by default sorts each line lexically, but one or more fields to sort can be specified with one or more -k <field_number> options
    • options to sort numerically (-n), or numbers-inside-text (version sort -V)
    • -t <delim> to change the field delimiter (whitespace -- one or more spaces or Tabs – by default)
  • uniq -c counts groupings of its input (which must be sorted) and reports the text and count for each group
  • Anchor
    grep -P '<pattern>'
    searches for <pattern> in its input and outputs only lines containing it
    • always enclose <pattern> in single quotes to inhibit shell evaluation!
    • -P says use Perl patterns, which are much more powerful than standard grep patterns
    • -c says just return a count of line matches
    • -n says include the line number of the matching line
    • -v (inverse match) says return only lines not matching the pattern
    • -L says return only the names of files containing no pattern matches
    • -l says return only the names of files that do contain the mattern match
    • <pattern> can contain special match meta-characters and modifiers such as:
      • ^ – matches beginning of line
      • $ – matches end of line
      • .  – (period) matches any single character
      • * – modifier; place after an expression to match 0 or more occurrences
      • + – modifier, place after an expression to match 1 or more occurrences
      • \s – matches any whitespace (\S any non-whitespace)
      • \d – matches digits 0-9
      • \w – matches any word character: A-Z, a-z, 0-9 and _ (underscore)
      • \t matches Tab; \r matches Carriage return; \n matches Linefeed
      • [xyz123] – matches any single character (including special characters) among those listed between the brackets [ ]
        • this is called a character class.
        • use [^xyz123] to match any single character not listed in the class
      • (Xyz|Abc) – matches either Xyz or Abc or any text or expressions inside parentheses separated by | characters
        • note that parentheses ( ) may also be used to capture matched sub-expressions for later use
    • Regular expression modules are available in nearly every programming language (Perl, Python, Java, PHP, awk, even R)
      • each "flavor" is slightly different
      • even bash has multiple regex commands: grep, egrep, fgrep.
    • There are many good online regular expression tutorials, but be sure to pick one tailored to the language you will use.
  • Anchor
    '<script>' a powerful scripting language that is easily invoked from the command line
    • <script> is applied to each line of input (generally piped in)
      • always enclose <script> in single quotes to inhibit shell evaluation
    • General structure of an awk script:
      • BEGIN{BEGIN <expressions>}  –  use to initialize variables before any script body lines are executed
        • e.g. BEGIN{BEGIN FS=":"; OFS="\t"; sum=0} says
          • use colon (:) as the input field separator (FS), and tab (\t) as the output field separator (OFS)
            • the default input field separator (FS) is whitespace
              • one or more spaces or tabs
            • the default output field separator (OFS) is a single space
          • initialize the variable sum to 0
      • {<body expressions>}  – expressions to apply to each line of input
        • use $1, $2, etc. to pick out specific input fields
        • e.g. {print $3,$4} outputs fields 3 and 4 of the input, separated by the output field separator.
      • END{END <expressions>} – executed after all input is complete (e.g. print a sum)
    • Here is an excellent awk tutorial, very detailed and in-depth
      •  take a look once you feel comfortable with the example scripts we've gone over in class.