CSc 150

Project 1 -- Looking Out for Number One
Due: Wednesday, April 7, 2010

Objectives

Introduction

Say you were given a list of statistics. The source could be almost anything: batting averages, city populations, number of sunspots observed each day for a year-- whatever. If you were to look at each number and figure out how many of them started with a 1, how many started with a 2, etc., you'd expect an equal distribution, since the digits that make up those numbers can vary pretty randomly.

But you'd be wrong. There is a strange, but true, law known as Benford's Law that says that in any given list of statistics, the number 1 will appear most often as a leading digit (about 30% of the time). 2 will appear next most often, followed by 3, etc. The graph below shows the law at work for a variety of data sets.

Graph courtesy of Steve Wolfman

Most people find Benford's Law surprising and counterintuitive (but neat!) If you'd like to know more about why Benford's Law works and its history, read this article by Jon Walthoe which does a pretty good job of explaining it. Furthermore, a computer program is great at showing mathematical phenomena like this one in action. That's exactly what you'll do in this assignment: number crunching statistical files to prove to yourself that Benford's Law really is true.

Your Mission

Create a program that takes a file of statistics as input, computes how many of the numbers have x as a leading digit for all x 1-9, and then plots a bar graph showing the relative frequency of each of the digits when they lead off a number. The output of your program should look something like this:
File: "sunspot_data.txt"

frequency of the lead digit
---------------------------
1: *****************************************************************
2: **********************************************
3: ****************************************
4: ********************************
5: *************************
6: **********************
7: ********************
8: ********************
9: ******************
Don't forget to include the filename and a title for your bar graph. Well-labeled output is part of good design!

The Details

To help you with the file I/O, I'm providing a Tokenizer class on Blackboard that you should use to help obtain input. It has only one instance method, nextToken, that returns the next unread token (i.e. a number) as a string. Consult these Javadocs to learn how to use it, and test it out first! For example, you could write a small program that simply reads every number from the input file and then prints it.

You'll also want to consult the Java API docs to learn about methods that can help you parse the string you get from the Tokenizer. You'll need to strip out the first character from the string and turn it into a number (i.e. something of type int). Both the String and Character classes will be useful to look at. Get used to looking up stuff in the docs like this -- that's how you save yourself time in the long run by not having to reinvent code that already does what you want.

In addition, here are some sample input files to test your code on.

Your tests should not stop here though. You are required to find two other input files that demonstrate Benford's law and include them with your project. Remember, the data cannot be too random or too constrained or else Benford's Law won't apply. (Read the article above for why.) But that still leaves lots of choices. And make sure the two files you pick are different types of data -- not two files of population data from two different states, for example. Get creative to make a more convincing argument of how widespread Benford's law really is.

PITFALL ALERT: The input files need to be in a particular format (see the Tokenizer docs). So look for spreadsheet files where the data you want is in a single column that you can easily cut and paste into a text file. That'll make your life much easier.

Document in Javadoc style

You are required to comment your code in Javadoc style. (The Tokenizer class is commented in this way.) I'll show examples of how to do this in class, but you should also consult Sun's reference page which shows good examples. At minimum, every parameter and return value of every method should be commented using the @param and @return tags. Every class should have a descriptive comment too. If you're using Eclipse, you can automatically generate the Javadoc web pages by going to the Project menu - Generate Javadoc.

Some Thoughts on Design

Remember that you're being graded not just on if the program works, but how well you designed your code. Try to think in an object-oriented way. Since this first program is mainly to get you used to Java, I'll help you out this first time. Try to follow what we did in class. There is always a main method that the program starts with, and it's in a class by itself. That method creates and sends messages to other objects to get everything else done. What kinds of objects would be involved here? One way to go is to have a BenfordTester object whose job (behavior) is to process the input file and do the graphing. This BenfordTester class could then have instance methods to define this behavior, like those below:
/** Counts the number of times the number n is the first digit
 *  in file filename for all n=1...9.  
 *  
 *  You'll be using the Tokenizer class in this method to get
 *  each number from filename and then figuring out
 *  what its leading digit is.  How you use this information
 *  to figure out the total count for all 9 digits is up to you.
 */
public void counter(String filename)

 
/** Uses the nine total counts (one for each digit 1-9) to form
 *  a histogram (a bar graph) of leading digit frequency.
 *
 *  One asterisk will be printed in the bar graph for every
 *  unitsPerAsterisk in the count.  For example, say you had
 *  counted that '2' was the leading digit 100 times.  If
 *  unitsPerAsterisk was 25, then you'd print 4 asterisks for the
 *  '2' bar (1 asterisk for each 25 in the count).  If
 *  unitsPerAsterisk was 1, then you'd print 100 asterisks for
 *  the '2' bar (1 asterisk for each 1 in the count).  Being able to
 *  adjust this for different statistics files will let you see
 *  coarser or finer levels of detail so you can make an easy-to-read
 *  bar graph.
 */
public void grapher(int unitsPerAsterisk)	

In this way, you separate the counting task from the graphing task. You are not required to use the methods above. They're just suggestions. You may also use them but alter the parameters and return types, if you wish.

Remember to practice good programming skills:

Grading

You should check out the grading standards on our course web page that will be used for all programming projects. This project will be worth 50 points. It will be divided up this way:

What to turn in

Be sure to include your own input files in your project when you turn it in. If you're using Eclipse, you can just include it in the project folder before you zip it up. Please turn in both a paper and an electronic copy of your project. The zipped up project will be turned in to the Dropbox on Blackboard.

Having trouble? Don't wait until the last minute! Come see me and get your $80 worth.

Gentle Reminder

Programming assignments, like homework assignments, are individual projects. I encourage you to talk to others about the general nature of the project and ideas about how to pursue it. However, the technical work, the writing, and the inspiration behind these must be substantially your own. If any person besides you contributes in any way to the project, you must credit their work on your project. Similarly, if you include information that you have gleaned from other published sources, you must cite them as references. Looking at, and/or copying, other people's programs or written work is inappropriate, and will be considered cheating.
Return to Project Index