# Can Computers Think?Introduction to Computer Science

## CSC 106Union CollegeWinter 2010

### Programming Project 4 - due Monday, March 8, at the end of the day

The counts that I mention in the comment for test_step6() in the python file are wrong. They should be 469 and 512. Also, some people are getting a slightly different count for step 4, and I have not yet found why that is. As long as the difference is small, just move on to the next step. It will hardly affect the results you get in the end on step 7.

#### Goal:

The goal of this project is to write a python program that tells you whether a given text  paragraph was written by one of two authors.

#### Step 0: Getting Started

• Clinton_training.txt contains text by Clinton that you should use to estimate how frequent different words and sequences of words are in Clinton's speeches.
• Clinton_testing.txt contains some additional passages of text by Clinton that you should use to test how well your program can decide whether a given passage was written by Clinton or Bush.
• Bush_training.txt contains text by Clinton that you should use to estimate how frequent different words and sequences of words are in Clinton's speeches.
• Bush_testing.txt contains some additional passages of text by Bush that you should use to test how well your program can decide whether a given passage was written by Clinton or Bush.
• author_classification_getting_started.py contains some code to get you started. It has stubs for all the functions you will be asked to write. And it has test functions for each step of the assignment.

#### Step 1: Building a unigram (word) frequency dictionary

Write a function called `make_unigram_dict` that takes a filename (string) as its only parameter. The function should read in the file and build a dictionary that maps words to their frequency. For example, if the word 'people' appears 12 times in the file, the dictionary should have the value 12 associated with the key 'people'.

Use the function `test_step1` to test your code.

#### Step 2: Building a bigram frequency dictionary

A bigram is a sequence of two words.

Write a function called `make_bigram_dict` that takes a filename (string) as its only parameter. The function should read in the file and build a dictionary that maps pairs of words to their frequency. For example, if the word 'people' followed by the word 'will' appears 12 times in the file, the dictionary should have the value 12 associated with the key 'people+++will'. (Note that the keys are strings. They consist of the two words, joined together by '+++'.)

Use the function `test_step2` to test your code.

#### Step 3: Counting the total number of words (tokens)

Write a function called `count_words` that takes a word frequency dictionary (such as the one produced by the function `make_unigram_dict`) as its parameter and calculates the total number of word tokens - if the word 'people' occurs 12 times, it is counted 12 times. That is, the function should add up the frequency values of all dictionary entries.

Use the function `test_step3` to test your code.

#### Step 4: Counting the overall number of different (unique) words

Write a function called `count_unique_words` that takes a list of filenames as its parameter. It should return the size of the vocabulary used in all of the files. That is, it should count how many different words are used in the text files. Words that are used multiple times are only counted once.

Use the function `test_step4` to test your code.

#### Step 5: Calculating the probability of a string based on each words unigram probability

Write a function called `string_prob_unigrams` that takes a string of words as its first parameter. It should calculate the probability of this string based on the unigram probabilities of the words. Since these probabilities can get very small (so small that Python cannot accurately represent them anymore), you need to work with the logarithm of the probabilities rather than the normal probabilities. Note that the function `unigram_prob`, which calculates the probability of one word, already returns the logarithm of the probability. So, to caluclate the probability for a whole string of words, you need to look at each word and use the function unigram_prob to calculate the (logarithm of the) unigram probability of each word, and you need to sum them up.

That is, if your string consists of the words w1 w2 ... wn: string_prob = unigram_prob(w1) + unigram_prob(w1) + ... + unigram_prob(wn)

Use the function `test_step5` to test your code.

#### Step 6: Calculating the probability of a string based on bigram probabilities

Write a function called `string_prob_bigrams` that takes a string of words as its first parameter. It should calculate the probability of this string based on the bigram probabilities of the pairs of words composing the string. Again, you need to work with the logarithm of the probabilities rather than the normal probabilities. Note that the function `bigram_prob`, which calculates the probability of one bigram, already returns the logarithm of the probability. So, to caluclate the probability for a whole string of words, you need to look at each pair of consecutive words and use the function bigram_prob to calculate the (logarithm of the) bigram probability for each such pair, and you need to sum these probabilities up.

That is, if your string consists of the words w1 w2 ... wn: string_prob = unigram_prob(w1) + bigram_prob(w1,w2) + ... + bigram_prob(wn-1,wn)

Use the function `test_step6` to test your code.

#### Step 7: Testing the implementation

Use the function `test_step7` to test how well this method of determining the authorship of a text passage works. How many of the passages written by Clinton are correctly classified? How many of the passages written by Bush are correctly classified?