Can Computers Think?
Introduction to Computer Science

CSC 106
Union College
Winter 2010

Programming Project 4 - due Monday, March 8, at the end of the day

The counts that I mention in the comment for test_step6() in the python file are wrong. They should be 469 and 512. Also, some people are getting a slightly different count for step 4, and I have not yet found why that is. As long as the difference is small, just move on to the next step. It will hardly affect the results you get in the end on step 7.

Goal:

The goal of this project is to write a python program that tells you whether a given text  paragraph was written by one of two authors.

Step 0: Getting Started

Download the following files:

Step 1: Building a unigram (word) frequency dictionary

Write a function called make_unigram_dict that takes a filename (string) as its only parameter. The function should read in the file and build a dictionary that maps words to their frequency. For example, if the word 'people' appears 12 times in the file, the dictionary should have the value 12 associated with the key 'people'.

Use the function test_step1 to test your code.

Step 2: Building a bigram frequency dictionary

A bigram is a sequence of two words.

Write a function called make_bigram_dict that takes a filename (string) as its only parameter. The function should read in the file and build a dictionary that maps pairs of words to their frequency. For example, if the word 'people' followed by the word 'will' appears 12 times in the file, the dictionary should have the value 12 associated with the key 'people+++will'. (Note that the keys are strings. They consist of the two words, joined together by '+++'.)

Use the function test_step2 to test your code.

Step 3: Counting the total number of words (tokens)

Write a function called count_words that takes a word frequency dictionary (such as the one produced by the function make_unigram_dict) as its parameter and calculates the total number of word tokens - if the word 'people' occurs 12 times, it is counted 12 times. That is, the function should add up the frequency values of all dictionary entries.

Use the function test_step3 to test your code.

Step 4: Counting the overall number of different (unique) words

Write a function called count_unique_words that takes a list of filenames as its parameter. It should return the size of the vocabulary used in all of the files. That is, it should count how many different words are used in the text files. Words that are used multiple times are only counted once.

Use the function test_step4 to test your code.

Step 5: Calculating the probability of a string based on each words unigram probability

Write a function called string_prob_unigrams that takes a string of words as its first parameter. It should calculate the probability of this string based on the unigram probabilities of the words. Since these probabilities can get very small (so small that Python cannot accurately represent them anymore), you need to work with the logarithm of the probabilities rather than the normal probabilities. Note that the function unigram_prob, which calculates the probability of one word, already returns the logarithm of the probability. So, to caluclate the probability for a whole string of words, you need to look at each word and use the function unigram_prob to calculate the (logarithm of the) unigram probability of each word, and you need to sum them up.

That is, if your string consists of the words w1 w2 ... wn: string_prob = unigram_prob(w1) + unigram_prob(w1) + ... + unigram_prob(wn)

Use the function test_step5 to test your code.

Step 6: Calculating the probability of a string based on bigram probabilities

Write a function called string_prob_bigrams that takes a string of words as its first parameter. It should calculate the probability of this string based on the bigram probabilities of the pairs of words composing the string. Again, you need to work with the logarithm of the probabilities rather than the normal probabilities. Note that the function bigram_prob, which calculates the probability of one bigram, already returns the logarithm of the probability. So, to caluclate the probability for a whole string of words, you need to look at each pair of consecutive words and use the function bigram_prob to calculate the (logarithm of the) bigram probability for each such pair, and you need to sum these probabilities up.

That is, if your string consists of the words w1 w2 ... wn: string_prob = unigram_prob(w1) + bigram_prob(w1,w2) + ... + bigram_prob(wn-1,wn)

Use the function test_step6 to test your code.

Step 7: Testing the implementation

Use the function test_step7 to test how well this method of determining the authorship of a text passage works. How many of the passages written by Clinton are correctly classified? How many of the passages written by Bush are correctly classified?

Submit on Blackboard