Take Home Exam 3 - due Friday, 11/6 before class
Note: This is a take home exam. That means that (differently from the homework assignments) you have to work on this on your own. Do not talk to class members or any one else about the questions or your solutions. Do not use any resources other than a) the text book (Gaddis, Starting out with Python), b) the online text book Think Python, c) the python documentation, and d) the pygame documentation.
For all programming problems submit both an algorithm description and the implementation.
Here are two text files to work with. These are big files. So, for testing purposes, they may be kind of unwieldy and you may want to create your own, smaller text file for testing.
- George W. Bush's State of the Union addresses (2001-2006)
- Bill Clinton's State of the Union addresses (1993-2000)
1) Creating a word list from a file
Write a function that reads in a text file and creates a list of all the different words in the file. That is, the list should not have any duplicates; even if a word appears multiple times, it should only be in the list once. The function should take one parameter: the filename as a string. It should have a return value: the list of words.
Hint: So far we have mostly used the split
-function used in the following way:
stringlist = string.split(":")That is, we have used the colon (":") as a marker where to split. You can use other markers for where to split by just providing the right parameter value to the
split
-function. For example, if you want to split at every "+" sign, the function call would be string.split("+")
.
Another hint: You can use the in
-keyword to test whether a value is contained in a list. For example, 4 in list
evaluates to true, if list
contains 4 and false, if list
does not contain 4.
2) Creating a word frequency dictionary from a file
Write a function that reads in a text file and creates a dictionary that maps all the different words in the file to their frequency. The function should take one parameter: the filename as a string. It should have a return value: the dictionary.
Hint: You can use the dictionary method has_key
to test whether an entry with a given key exists in a dictionary. For example, d.has_key("hat")
evaluates to true, if the dictionary d
has an entry with key "hat".
3) Finding the 100 most frequent words
Write a function that takes a dictionary of word-frequency pairs (such as the one created in the previous exercise) and creates a list of the 100-most frequent words.
Hint: There are different strategies that you can use. Here is one: Take the dictionary and turn it into a list, where the elements of the list are tuples (i.e. lists of 2 elements) with the frequency as the first element of the tuple and the second element the word. Then use the list-method sort
to sort the list. This should give you a list with the low frequency words in the beginning and the high frequency words at the end. You can then chop off the last 100 words.
Bonus: comparing authors
Now, write a program that takes two text files (such as the ones given above) and compares them based on the most frequently used words in both of them.
That is, for each of the two files, create a list of the 100 most frequent words (using the functions developed in the first part of this take home exam). Then build two new lists: the list of all words that are among the 100 most frequent in file 1 but not among the 100 most frequent for file 2; and the list of all words that are among the 100 most frequent in file 2 but not among the 100 most frequent of file 1.
Examine the output. Where do you think the differences come from?
Submit
Submit on Blackboard. If it does not let you upload your files, send them to me by email and use the following subject line "CSC 105 take home exam 2".