grep, sort, and uniq

Assignment 3 Answers

  1. Find all names in the file dist.male.first that begin with 'JEFF'.
    $ grep '^JEFF' dist.male.first
    
  2. Find all names in the file dist.male.first that end in 'SON'.
    $ grep 'SON ' dist.male.first
    
    (You can't use $ after SON because the name is not the last thing on the line. I just used a space to ensure that it was at the end of the name.)
  3. Find all names in the file dist.male.first that end in 'SON' and sort them alphabetically.
    $ grep 'SON ' dist.male.first | sort
    
  4. Find which names are in both the male and female files. (You will need to use cut or sed to get just the names from the files.)
    $ cut -d' ' -f1 dist.*male.first |\
    > sort firstnames | uniq -d 
    
    (dist.*male.first will match both dist.female.first and dist.male.first filenames.)
  5. Combine the files of male and female first names and sort them by rank (column 4).
    $ sort +3n dist.*male.first
    
  6. Combine the files of male and female first names and sort alphabetically and remove duplication.
    $ cut -c1-15 dist.*male.first | sort -u
    
    OR
    $ cut -c1-15 dist.*male.first | sort | uniq
    
    You can also use the -f option with space as a delimiter. This will only work for the first feild, though, because there are multiple spaces seperating feilds. For some reason:
    cut -f1 -d' ' dist.male.first dist.female.first
    
    does not work but
    cat dist.male.first dist.female.first | cut -f1 -d' '
    
    does. I can't explain it. According to the man page, cut can accept multiple files, which it does with the -c option. It might be a bug in cut. If you figure it out let me know.
  7. Display the female names that start with L.
    $ grep '^L' dist.female.first
    
  8. Starting with the file 'names' create a list of the first names in alphabetical order with the number of their rank by occurrence before each name. (Hint: nl will add line numbers to the output.) (3 marks)
    $ cut -f1 -d' ' names | sort | uniq -c |\
    > sort -nr | nl | sort +2 
    
    We end up with a file with 3 columns: rank, frequency, and name. The file is sorted alphabetically by name. Here is what is happening:
    1. Get only the first names.
    2. Sort them alphabetically.
    3. uniq removes duplicates and includes the number of occurances of each.
    4. Sort by the number of occurances, numeric, highest to lowest.
    5. Number the lines. This give us a ranking number.
    6. Sort it alphabetically by name.