CSC-103: Taming Big Data

Course Overview

Taming Big Data uses big data and computational science as the theme for learning about computers and computing. It will focus on applications from the sciences and social sciences.

We'll start the course with an overview of how we communicate with a computer and what programming is all about. Then we will move on to look at how we get the computer to manipulate data. By the end of the course students are able to develop programs that manipulate data for problems such as simulation, classification, and financial analysis. They will also be able to prepare data so that it can be processed by existing applications and tools.

The course starts out with a series of relatively small assignments. In the later portion of the term the problems grow in size, in keeping with students' increased ability to handle large amounts of data that are stored in files.

Language & Resources

In this course the bulk of programming is done in Python. Increasingly students who are very interested in the course topic are also encouraged to explore R.

In class, you are required to use our lab iMacs. However, when working on your projects outside of class, you have a choice. If you'd like to continue using our iMacs, feel free! We have three spaces that you can use:

All of these labs are available to you 24/7 using your ID card, except when classes are being held in them.

Course Text

Example Assignment

CSC-103
Fall 2012
Final Programming Assignment

Here are two data files. Each line of these files contains data for a single day. The line looks like
19600101,28.4
where the first thing on the line is the year - month - day in a YYYYMMDD format, followed by the daily high temperature. So the example line is for January 1, 1960, and the high temperature was 28.4 degrees (all temperatures are Fahrenheit).

Your assignment:

NOTE: If you really really really want data for a different site, let me know. The program that generates each data file takes several hours to run, because it has to read in a website for each day of the 12 years. I might consider rerunning it for other locations. Alternately, I'm happy to give you the .py file for my program, along with the necessary library. You can figure out the Weather Underground code for the location you are interested in and edit the program accordingly. This will earn you some small amount of extra credit.

Example Lab

File Reading Exercise

The file google-closing-price.csv contains just the daily closing price of Google stock from the day the stock went public through the end of August, 2012. Save this file into your directory.

Your task is to compute the average closing price over the entire period that the stock has been traded.

DO NOT USE readlines() or read(). DO USE readline().

Remember that the easy way to handle file access is to work in your own directory, and run python from the OS prompt in a Terminal window.

How to do this? If you want to plunge on ahead with no helpful hints, go for it! If you want some hints and tips, click here.

Assignments & Grades

There will be a regular programming assignments. As the term progresses these will utilize an increasing number of features of the Python language. We may also use other packages, such as the R statistical analysis package, later in the term.

There will be two in-class exams and a final exam. There may also be pop quizzes on material covered in prior classes and the reading. The intent is not that these be "punitive" in any way but, rather, that they provide motivation for you to keep up. Learning to program is like learning a foreign language. If you don't speak it during some part of every day your progress will be quite slow.

Grading:

The allocation of emphasis among the course components is as follows:

Late Work:

No homework will be accepted late unless a prior arrangement is made. Just in case you missed that the first time No homework will be accepted late unless a prior arrangement is made.

All hardcopy of homework is due at the beginning of class on the due date. Electronic submission of program executables must be done before you arrive at class.

Schedule

Class Schedule

(subject to change)

Classes
Topic Programming Concepts Readings
Week 1 & 2
How do we communicate with a computer?
How do we make the computer do what we want?
What is Computer Science?
What is computational science?
What is programming?
Introduction to algorithms, programs, functions, variables, arithmetic.
Working with Python
PP: Ch 1, Ch 2
Week 3
We have all this data!
How do we manipulate it,
and make decisions based on it?
Lists, Introduction to control flow (repetition)
Modules, Introduction to Objects & Methods

PP: Ch 4, Ch 5, Ch 7
Week 4
What about text data?Strings PP: Ch 3
Week 5
Exam on 10/4
Can we do things more than once?
More control flow, making choices, more repetition PP: Ch 6 & 7
Week 6
Need to find something in that data?
Is your data stored in a file?

Search, Nested Lists, File Processing

PP: Ch 8
Week 7
Sometimes data comes in interesting groups or relationshipsSets and DictionariesPP: Ch 9
Week 8
Sometimes things are easier if information is in order.
Sometimes data items are connected to each other.
And sometimes programs blow up!
Finish dictionaries and sets
Exceptions
Search and sort
Exam on 10/30
 
Week 9
Computation in various disciplines  PP: Chapter 14
Week 10
  Regular Expressions
External programs
Wilson: 2.1-2.3, 3.1-3.4 (will be provided)