Homework Week 8 - due Wednesday, June 3rd, before class
This homework is due on Wednesday, June 3rd, which is the Wednesday of week 10.
Overview: What you will do in this homework
You will process two text files and then compare them based on the most frequent words in each file. More specifically, you will find out what words are among the 100 most frequent words in text 1 which are not among the 100 most frequent words in text 2, and vice versa.
So, you will do the following steps:
- Read in file 1 and create a dictionary that stores how often each word occurred. n(This is similar to what we have done in class and you can reuse some of the code given above.)
- Do the same for file 2.
- Take dictionary 1 and create a list of the 100 most frequent words.
- Do the same for dictionary 2.
- Compute the list of all words that are among the 100 most frequent words of file 1 but not among the 100 most frequent words of file 2.
- And vice versa: compute the list of all words that are among the 100 most frequent words of file 2 but not among the 100 most frequent words of file 1.
- Look at the output. Where do you think the differences are coming from?
Reading in the files (Steps 1 and 2)
As I said before, this is the same thing as what we did in class. You can (re-)use the code for this in-class exercise.
Here are the two text files (retrieved from C-SPAN State of the Union videos and transcripts:
- George W. Bush's State of the Union addresses (2001-2006)
- Bill Clinton's State of the Union addresses (1993-2000)
Create a list of the 100 most frequent words (Steps 3 and 4)
For this, you need to create sorted lists (sorted by the count; not alphabetically by the words) from the dictionaries produced in the previous step. We also did that in class the other day and you can again reuse the code.
After you have sorted lists, you can simply chop off the last 100 elements.
Compute the list of words that among the 100 most frequent words in one file but not in the other (Steps 5 and 6)
The last step produces two lists - one for each input file. Each of these lists has 100 elements representing the 100 most frequent words in the file. Each element is either a single word (string) or a tuple of a number (frequency) and a string (word), depending on how you implemented the last step. Let's call these two lists h1 and h2.
Now you want to create a new list that contains all words mentioned in h1 that are not mentioned in h2.
And then, do the same thing the other way round: create a second new list which contains all words mentioned in h2 that are not in h1.
Examine the output
Where do you think the differences are coming from?