Homework 2: Analyzing Structured Data using CSV and JSON

Due Date: See class schedule

In this assignment you’ll learn how to process and analyze data structured in common formats using Python. You’ll gain experience in using Python data structures, and also useful Python modules for processing CSV and JSON data.

This homework assignment is less guided than the previous one. I want you to pick a data source that you’re interested in, from the list of available sources from data.gov. (If, for some reason, you wish to use a data source not on this page, please clear it with me first). The data source you choose must have either a CSV or JSON format available.

(Hint:  if you don’t want to spend a lot of time going through data files looking for a good one, you’re welcome to use the file 2010-census-by-zip-simplified.csv on T-Square in the resources folder.  It’s a version of the 2010 census data linked off of data.gov, but with spaces removed from the field names to make the assignment a bit simpler.

Another hint: working with CVS data files will generally be a bit easier than working with JSON data files, but you can pick whichever you’d like.)

You can think of these data files as representing tabular data, encoded into either CSV or JSON format. The table consists of multiple fields, which you can think of as columns. Each entry expresses the relations among a set of fields, which you can think of as being the rows in the table. While JSON data is potentially a bit more complex than this conceptually, most of the data sets at data.gov still basically represent tabular data. You must pick a data source that has at least four fields in it and at least 10 rows.

Your code must load the data you’ve chosen, and then support a variety of queries that can be typed in at the command line by a user. You can hard-code the reference to your data file, but please set up your code so that it expects the data file to be in the same folder as your Python code (and so references it using just the base filename), and doesn’t contain any platform-specific path characters. I should be able to copy your code and data file to my computer and run it without problem. For reading user input from the command line, you’ll want to use raw_input() or one of its variants.

The queries you should support are:

  • max <field>: given a particular field name, find and print the data that has the maximum value for this field across your entire data set.
  • min <field>: given a particular field name, find and print the data that has the minimum value for this field.
  • avg <field>: given a field name, compute the average of all values in your data for this field. (You can assume this will only be called on fields that have numeric values.)
  • search <field> <value>: find all data in your dataset for which the named field has the indicated value.
  • range <field> <min> <max>: find all data in your dataset for which the named field has values between the specified min and max, and display in ascending order.
  • quit: exit the program.

In essence, these commands make up a simple command language that will allow your user to explore the data. When these commands are entered into your program, your code should parse them and execute them correctly.

(Hint: some of the data will have field names that contain spaces, such as “Total Population.” This will make parsing the commands a bit trickier. For this assignment,  if you don’t want to code your program to deal with spaces in your field names, you can either choose a data set that doesn’t have spaces in the field names, or simply edit the data to remove the spaces.)

Here’s an example interaction, for a table of population estimates from 2013:

STATE        POP_ESTIMATE
Alabama      4833722
Alaska       735132
Arizona      6626624
Arkansas     2959373

> min POP_ESTIMATE
735132
> max POP_ESTIMATE
6626624
> avg POP_ESTIMATE
3788713
> search STATE Alaska
STATE          POP_ESTIMATE
Alaska         735132
> range POP_ESTIMATE 2000000 5000000
STATE          POP_ESTIMATE
Arkansas       2959373
Alabama        4833722

Note that if your data doesn’t lend itself well to these queries, please see me… we can come up with an alternative set that makes more sense for your data.

To submit your assignment, please create and submit on T-Square a ZIP file that contains the following:

  • Your python program
  • The data file you’re using (if this file is larger than a megabyte, please trim it down before submitting… be sure to make sure your program works on this smaller data file before submitting it)
  • A short README.txt file that contains examples of the above queries that produce results using your data file.
EXTRA CREDIT:

Up to 10 points for additional sorts of queries, depending on complexity and usefulness.