Programming Project 4 - due Monday, March 8, at the end of the day
The counts that I mention in the comment for test_step6() in the python file are wrong. They should be 469 and 512. Also, some people are getting a slightly different count for step 4, and I have not yet found why that is. As long as the difference is small, just move on to the next step. It will hardly affect the results you get in the end on step 7.
Goal:
The goal of this project is to write a python program that tells you whether a given text paragraph was written by one of two authors.
Step 0: Getting Started
Download the following files:
- Clinton_training.txt contains text by Clinton that you should use to estimate how frequent different words and sequences of words are in Clinton's speeches.
- Clinton_testing.txt contains some additional passages of text by Clinton that you should use to test how well your program can decide whether a given passage was written by Clinton or Bush.
- Bush_training.txt contains text by Clinton that you should use to estimate how frequent different words and sequences of words are in Clinton's speeches.
- Bush_testing.txt contains some additional passages of text by Bush that you should use to test how well your program can decide whether a given passage was written by Clinton or Bush.
- author_classification_getting_started.py contains some code to get you started. It has stubs for all the functions you will be asked to write. And it has test functions for each step of the assignment.
Step 1: Building a unigram (word) frequency dictionary
Write a function called make_unigram_dict
that takes a filename (string) as its only parameter. The function should read in the file and build a dictionary that maps words to their frequency. For example, if the word 'people' appears 12 times in the file, the dictionary should have the value 12 associated with the key 'people'.
Use the function test_step1
to test your code.
Step 2: Building a bigram frequency dictionary
A bigram is a sequence of two words.
Write a function called make_bigram_dict
that takes a filename (string) as its only parameter. The function should read in the file and build a dictionary that maps pairs of words to their frequency. For example, if the word 'people' followed by the word 'will' appears 12 times in the file, the dictionary should have the value 12 associated with the key 'people+++will'. (Note that the keys are strings. They consist of the two words, joined together by '+++'.)
Use the function test_step2
to test your code.
Step 3: Counting the total number of words (tokens)
Write a function called count_words
that takes a word frequency dictionary (such as the one produced by the function make_unigram_dict
) as its parameter and calculates the total number of word tokens - if the word 'people' occurs 12 times, it is counted 12 times. That is, the function should add up the frequency values of all dictionary entries.
Use the function test_step3
to test your code.
Step 4: Counting the overall number of different (unique) words
Write a function called count_unique_words
that takes a list of filenames as its
parameter. It should return the size of the vocabulary used in all of the files. That is, it should count how many different words are used in the text files. Words that are used multiple times are only counted once.
Use the function test_step4
to test your code.
Step 5: Calculating the probability of a string based on each words unigram probability
Write a function called string_prob_unigrams
that takes a string of words as its
first parameter. It should calculate the probability of this string based on the unigram probabilities of the words. Since these probabilities can get very small (so small that Python cannot accurately represent them anymore), you need to work with the logarithm of the probabilities rather than the normal probabilities. Note that the function unigram_prob
, which calculates the probability of one word, already returns the logarithm of the probability. So, to caluclate the probability for a whole string of words, you need to look at each word and use the function unigram_prob to calculate the (logarithm of the) unigram probability of each word, and you need to sum them up.
That is, if your string consists of the words w1 w2 ... wn: string_prob = unigram_prob(w1) + unigram_prob(w1) + ... + unigram_prob(wn)
Use the function test_step5
to test your code.
Step 6: Calculating the probability of a string based on bigram probabilities
Write a function called string_prob_bigrams
that takes a string of words as its
first parameter. It should calculate the probability of this string based on the bigram probabilities of the pairs of words
composing the string. Again, you need to work with the logarithm of the probabilities rather than the normal probabilities. Note that the function bigram_prob
, which calculates the probability of one bigram, already returns the logarithm of the probability. So, to caluclate the probability for a whole string of words, you need to look at each pair of consecutive words and use the function bigram_prob to calculate the (logarithm of the) bigram probability for each such pair, and you need to sum these probabilities up.
That is, if your string consists of the words w1 w2 ... wn: string_prob = unigram_prob(w1) + bigram_prob(w1,w2) + ... + bigram_prob(wn-1,wn)
Use the function test_step6
to test your code.
Step 7: Testing the implementation
Use the function test_step7
to test how well this method of determining the authorship of a text passage works. How many of the passages written by Clinton are correctly classified? How many of the passages written by Bush are correctly classified?