CSc 150

Project 7 -- The Grammar Checker
Due: Monday, November 17, 2008

Objectives

Introduction

Most word processors today have an automated grammar checker that checks for common grammar mistakes such as noun-verb agreement or comma placement. In this final project, you'll write a small version of this application that checks for overused words in a text file. When found, it replaces the overused words with synonyms from a thesaurus. In the process, you'll have to contend with all of the good design principles we've seen in this course, since most of the design will be in your hands.

Your Mission

Your assignment is to write a program that will do three things:
  1. Create a thesaurus from a text file of keywords and synonyms.
  2. Create a word counter that determines the frequency of each word in an input file (i.e. the number of times each word appears).
  3. Use the previous two data structures to look at each word in the input file, determine if it is overused (appears too many times), and, if so, replace the word with a random synonym (if the entry exists in the thesaurus).
The output to this program will be another text file containing the original input text, except that all of the overused words will be replaced by synonyms where possible. Nothing needs to be printed to System.out for this project.

Requirements

Most of the design of this project is up to you, but there are some required elements that everyone should have.

You should use a binary search tree to store the thesaurus information. Each node in the tree should contain an entry in the thesaurus and an associated synonym list. The synonym list is really just an unordered collection of words, so reuse your Bag class to store the synonyms. You should be able to insert new entries/synonyms into the thesaurus, delete entries, and print the thesaurus in alphabetical order by entry (especially useful for debugging). You are also required to have a getSynonymFor method which returns a random synonym for a given word.

The frequency information for each word should also be stored in a binary search tree. Like the thesaurus, the tree will be organized by keyword. For those who like to see it in pictures, here's a sampling of how binary search trees will be used to store both the thesaurus info and the word frequency info:

And, as always, you are also not permitted to use any of Java's built-in containers in your program, such as ArrayList, Vector, Stack, LinkedList, etc. Use your own.

Finally, all documentation should be in Javadoc format for this project.

The Algorithm

Once the thesaurus and word counter are built, you're ready to find overused words in the input file and replace them with synonyms. Here's an overview of the algorithm:
for each word in the input file
{
   check word counter for frequency of word
   if frequency above some threshold (meaning it's overused)
   {
      check thesaurus for an entry
      if entry not there
         write out original word
      else  // entry is there
      {
         get a random synonym for that entry
         write out the random synonym
      } 
   }
   else  // word not overused
      write out original word
}

The threshold for word overuse should be easily changeable in your program. 2 is a good default (so a word is overused if it appears more than twice in the file). You should flesh out this algorithm using top-down design.

Why a BST?

An important part of program design is deciding which data structures to use. As we study more and more ADTs, it is important for you to know the strengths and weaknesses of each one so you can pick the right structure for the job.

For this project, both the word counter and the thesaurus will be searched a lot. The word counter gets searched for every word in the input file (to see if it's overused). The thesaurus gets searched for each overused word found. So an ADT with a good running time for searches is preferable. A binary search tree is a good choice since it has an average case running time of O(log n) for the search operation compared to O(n) average case search time for an array or linked list.

Files to Download

I'm providing you with some files on BlackBoard to help you get started. They are: Check out the Javadocs for the starter code for more info on all of the classes I'm providing for you.

Where do I start?

The rest of the code is up to you. I won't be telling you every single class or method you need to write. Figuring out the design is part of your job. Here are some hints to get you started though:

Extra Credit alert: One can extend our algorithm in many ways. Here are just a couple of ideas that would net extra credit.

This project is big. START IT NOW. I'm giving you 2 weeks for a reason. I will be happy to critique and give feedback on designs you come up with, but I won't give you a design outright. That's your job. Think out the design on paper before you write a single line of code. And come see me even if you just want to make sure that you're on the right track.

Grading

This project is worth double a normal project due to its complexity. The 50 points will be divided up this way:

As always, practice good programming skills:

Remember to turn in both a paper and an electronic copy of your project. Be sure to turn in your own test input files.

Having trouble? Don't wait until the last minute! Come see me and get your $80 worth.

Administrative statement

Programming assignments, like homework assignments, are individual projects. I encourage you to talk to others about the general nature of the project and ideas about how to pursue it. However, the technical work, the writing, and the inspiration behind these must be substantially your own. If any person besides you contributes in any way to the project, you must credit their work on your homework. Similarly, if you include information that you have gleaned from other published sources, you must cite them as references. Looking at, and/or copying, other people's programs or written work is inappropriate, and will be considered cheating.
Return to Project Index