(mini) Unix for Poets

from Unix for Poets
by Kenneth Ward Church
AT&T Bell Laboratories
kwc @research att com

Disclaimer: this text was OCR'd and more-or-less HTML-ized by Paai and Diwi. Now, corrected by G. Rigau. Please note that there still are many errors, especially where shell-syntax is used. It is left as an exercise for the students to correct them.
We will add comments and reflections to the original sheets of Church. Such comments will be easily recognizable. 

Exercises to be addressed

  • See a file
  • Count words in a text
  • Sort a list of words in various ways
  • Extract useful info from a dictionary
  • Compute ngram statistics

  • Tools

    Please check the man-pages of the commands you are using and try to recognize the options that are used in the examples!

    Uncompress and see

    Type the following commands:
    file bible.txt.gz
    gunzip -c bible.txt.gz | more
    zmore bible.txt.gz
    gunzip -c bible.txt.gz | less
    gunzip -c bible.txt.gz | tail
    gunzip -c bible.txt.gz | head
    gunzip -c bible.txt.gz | wc
    gunzip -c bible.txt.gz | wc
    gunzip bible.txt

    Exercise 1: Count words in a text

    Algorithm

    1. Tokenize (tr)
    2. Sort (sort)
    3. Count duplicates (uniq -c)

    Solution to Exercise 1
    tr -sc 'A-Za-z' '\012' < bible.txt | sort | uniq -c | more

    1
     7973 a
    236 A
    1 aa
    350 Aaron
    2 Aaronites
    1 Abaddon
    1 Abagtha
    1 Abana
    4 Abarim
    ...


    Glue

    Note in the above example how the powerful syntax of a typical Unix-shell is used. If a program would expect input from the keyboard (stdin) it can also use input from an existing textfile (Bible.txt) by using the < sign. The > sign is used to direct output to another device than the device (stdout). The |-sign pipes the output of a program directly into the input of the next program. In this way you can create veritable assembly-lines of programs that progressively change the original input into the output you need. 

    read from input file <
    write to output file >
    pipe |

    Step by Step

    1) more bible.txt

    ...
    1:1 In the beginning God created the heaven and
    1:2 And the earth was without form, and void; an
    1:3 And God said, Let there be light: and there
    1:4 And God saw the light, that [it was] good: a
    ...

    2) tr -sc 'A-Za-z' '\012' < bible.txt | more

    DOC
    Welcome
    To
    The
    World
    ...

    3) Filtering with a simple gawk program ...

    gunzip -c bible.txt.gz | tr -sc 'A-Za-z' '\012' | gawk 'BEGIN{flag=0};$0~/\<TEXT\>/{flag=1;next};$0~/\<\/TEXT\>/{flag=0;next};{if(flag>0){print}}' > bible.clean

    4) Ordering and counting ...

    tr -sc 'A-Za-z' '\012' < bible.clean | sort | uniq -c | more

    7943 a
    234 A
    350 Aaron
    2 Aaronites
    ...


    More Counting Exercises


    sort lines of text

                man sort


    Sort Exercises


    Important Points Thus Far 


    Bigrams Algorithm

    tr -sc 'A-Za-z' '\012' < bible.clean > bible.words
    tail -n +2 bible.words > bible.nextwords

    paste bible.words bible.nextwords | more

    The Old
    Old Testament
    Testament of
    of the
    ...

    paste bible.words bible.nextwords | sort | uniq -c > bible.bigrams
    sort -nr < bible.bigrams | more

    11445 of the
    5964 the LORD
    4880 in the
    4044 and the
    2461 shall be
    ...



    Exercise 2: count trigrams of Bible


    grep & egrep: An Example of a Filter

    tr -sc 'A-Za-z' '\012' < bible.clean | grep 'ing$' | sort | uniq -c | more

    Example Explanation
    grep gh	        find lines containing "gh''
    grep '^con' find lines beginning with "con"
    grep 'ing$' find lines ending with "in"
    grep -v gh don't display lines containing "gh"
    grep -v '^con' don't display lines beginning with "con"
    grep -v 'ing$' don't display lines ending with "ing"


    More examples

    Example 		explanation
    grep '[A-Z]		lines with an uppercase char
    grep '^[A-Z] lines starting with an uppercase
    grep '[A-Z]$' lines ending with an uppercase
    grep '^[A-Z]|*$' lines with all uppercase chars
    grep '[aeiouAEIOU]'	lines with a vowel
    grep '^[aeiouAEIOU]' lines starting with a vowel
    grep '[aeiouAEIOU]$' lines ending with a vowel
    grep -i '[aeiou]' ditto
    grep -i '^[aeiou]'
    grep -i '[aeiou]$'
    grep-i '^[^aeiou]'	lines starting with a non-vowel
    grep -i ' [^aeiou]$' lines ending with a non-vowel
    grep -i ' [aeiou].*[aeiou]' lines with two or more vowels
    grep-i '^[^aeiou]*[aeiou][^aeiou]*$' lines with exactly one vowel


    Regular Expressions 

    Example	Explanation

    a match the letter "a"
    [a-z] match any lowercase letter
    [A-Z] match any uppercase letter
    [0-9] match any digit
    [0123456789] match any digit
    [aeiouAEIUO] match any vowel
    [^aeiouAEIOU] match any letter but a vowel
    . match any character
    ^ beginning of line
    $ end of line

    x* any number of x
    x+ one or more of x (egrep only)
    x | y x or y (egrep only)
    (x) override precedence rules (egrep only)


    Grep Exercises



    sed (string editor)

    
        

    sed exercises


    awk 



    Selecting Fields by Position

    awk '{print $1}'
    cut -f1
    awk '{print $2}'
    cut -f2
    awk '{print $NF}'
    rev | cut -f1 | rev
    awk '{print $(NF-1)}'
    rev | cut -f2 | rev
    awk '{print NF}'



    Exercise 3: sort the words in the Bible by the number of syllables (sequences of vowels). Which is the word with more syllables?


    Filtering by Numerical Comparison

    awk '$1 > 100 {print $0}' bible.hist
    awk '$1 > 100 {print}' bible.hist
    awk '$1 > 100' bible.hist


    Exercice 4
    : How many bigrams appear more than 10 times. 


    Filtering by String Comparison

    sort -u bible.words > bible.types
    rev < bible.types | paste - bible.types | awk '$1 == $2'

    a a
    A A
    aha aha
    deed deed
    did did
    ...
    1. == works on strings
    2. paste
    3. -

    Filtering by Regular Expression Matching

    awk '$2~/ed$/ {x = x + $1} END{print x}' bible.hist
    tr -sc 'A-Za-z' '\012' < bible.clean | grep 'ed$' | wc -l
    awk '$2~/ed$/ {x = x + 1} END{print x}' bible.hist

    tr -sc 'A-Za-z' '\012' < bible.clean | grep 'ed$' | sort | uniq -c | wc -l
    awk '/ed$/ {token = token + $1;
    type = type + 1}
    END {print token, type}' bible.hist

    awk '/ed$/ {token += $1; type++}
    END {print token, type}' bible.hist




    Exercice 5:
    It is said that English avoids sequences of -ing words. Find bigrams where both words end in -ing. Do these count as counter-examples to the -ing -ing rule? For comparison's sake, find bigrams where both words end in -ed. Should there also be a prohibition against -ed -ed? Are there any examples of -ed -ed in the Bible? If so, how many? Which verse(s)?


    Arrays


    Mutual Info: An Example of Arrays

    paste bible.words bible.nextwords | sort | uniq -c > bible.bigrams
    cat bible.hist bible.bigrams |
    awk 'NF == 2 { f[$2]=$1}
         NF == 3 { print log(N*$1/(f[$2]*f[$3]))/log(2), $2, $3}' 

    where N='wc -l bible.words'


    Exercice 6: Mutual information is unstable for small bigram counts. Modify the previous program so that it doesn't produce any output when the bigram count is less than 5.