Homework 3: Analyzing Data from the Web
Due date: See class schedule
In this assignment you’ll gain experience in writing larger programs, and in creating software that interacts with live web data. We will write a program that uses the Twitter API (application programming interface) to pull down a live stream of tweets. You will then use the text analysis methods that you learned in Homework 1 to analyze these tweets for sentiment and objectivity.
Background and Resources
Before we get started, some resources and pointers to additional information:
Python Twitter Tools – a handy “wrapper” around the raw Python API.
A very useful tutorial for the above Python Twitter Tools streaming and search APIs.
Overview of the raw Twitter streaming APIs – these will be less useful than the docs and above tutorial for the Python Twitter Tools but are given here for reference.
A somewhat out-of-date (but still very helpful) “map” of the data that’s returned from the API about each tweet. The figure below shows the raw JSON-formatted data that’s returned about each tweet.
Background Information:
There are a number of different Twitter APIs available. The one we’ll be using for this assignment is called Python Twitter Tools. This package actually contains two different sets of APIs: the streaming API (which allows you to get access to a sample of the tweets in realtime as they are published on Twitter), and the search API (also called the REST API, which is good for singular searches, getting user profile information, or posting Tweets). We’ll be using the streaming API for this project.
To set up and install this package you’ll need to do the following:
pip install twitter
(If you get errors during the install, you may need to run these commands with Administrator privileges. On the Mac, run sudo pip install twitter. You’ll need to enter your password.)
Next, you’ll need to get your API key for Twitter. This is a set of credentials—called OAuth credentials, named after the authentication method that Twitter uses—that your program will need in order to connect to the Twitter web service (so that Twitter knows the user on behalf of whom the program is running, and can flag and disable programs that are exhibiting malicious behavior). There are multiple steps to this process. These are outlined in the tutorial above, but for reference:
- First, create a Twitter user account if you do not already have one.
2. Go to https://apps.twitter.com and log in with your Twitter user account. This step gives you a Twitter developer account under the same name as your user account
3. Click “Create New App”
4. Fill out the form, agree to the terms, and click “Create your Twitter Application.” You might call the application something like CS6452Homework3, but you can give it any name you want.
5. In the next page, click on the “Keys and Access Tokens” tab. You’ll need to save copies of the information you’re about to get from the Twitter site; copying and pasting them into Sublime Text or another plain-text editor is perfect for this. Copy your “API key” and “API secret” (also called “Consumer Key” and “Consumer Secret”) and save them. Scroll down and click “Create my access token” and then copy your “Access token” and “Access token secret” to the same file as the other credentials.
When you copy and paste this information, be sure to get the names of the credentials along with the credentials themselves. The file you create should look like this:
ACCESS_TOKEN="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" ACCESS_SECRET="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" CONSUMER_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" CONSUMER_SECRET="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
6. Save this file someplace where you’ll remember it; you can give it any name you want, something like mytwitterauthdata.txt may be appropriate. You’ll need to use this credential information in your program when it connects to Twitter, as explained in the assignment details.
Finally, The Assignment Description…
Your goal in this assignment is to write a program that connects to Twitter using the streaming API, and then collect and analyze tweets for a search term that you prompt the user for. For each tweet that is pushed to you from Twitter, you should display the tweet itself, including the twitter handle and user name of the tweeter, and any hashtags included in the tweet. Also, for each tweet, you’ll display an objectivity and sentiment analysis of the tweet based on the TextBlob work that you did in the first homework.
For example, a single tweet might be displayed like this:
RT @perfectsliders: #Trump said "I was saddened to see how bad the ratings were on the Emmys last night, worst ever. Smartest people of the… Sent by user @PatriotMimiC (Mimi USA) #Trump Tweet sentiment is STRONGLY NEGATIVE (-0.566666666667) Tweet subjectivity is LIGHTLY SUBJECTIVE (0.577777777778)
Your program should do this for at least 10 tweets, and keep a running tally of overall sentiment and objectivity. At the end, this summative data should be displayed by your program. For example, you might display it like this:
Overall analysis of 10 tweets: 3 were positive (30.0%). 7 were negative (70.0%). Average sentiment value was -0.0928385416667 6 were subjective (60.0%). 4 were objective (40.0%). Average subjectivity value was 0.394444444444
In order to compute this summative data, each time you analyze an individual tweet, you’ll want to keep track of whether it’s positive or negative sentiment, and whether it’s subjective or objective; you might keep counters for each category of tweet to do this. To compute the averages, you’ll want to keep a running total of the sentiment and objectivity values and divide by the number of tweets.
(NOTE: sentiment and objectivity analysis works best when there’s a lot of text to work with. With short tweets, the results will be noisy, but still should give an overall sense of the data.)
A Word on Credentials…
When you write and test your program you’ll need to use your own credential information in order to connect. The problem is that Twitter discourages sharing of these authentication credentials (obviously), because if someone has your credentials, Twitter will think any program they’re running (potentially a malicious one) is being run on your behalf. So you don’t want to “hard code” your credential information in your program when you turn it in, because then you’re sharing it with me and the TA.
To work around this problem, I’ve written a bit of code called oauthfile.py, which is available on T-Square in the resources folder. This file contains a function called readOAuthFile() that will read in a file that contains OAuth credentials in the format above, and you’ll call this function from within your homework code, passing to it the pathname to your OAuth file. The idea is that when you’re writing and testing your program, you use readOAuthFile() to load your credentials from the file you’ve stored them in on your computer. Then, when we test your program, we’ll use the same function to load in our credentials from a different file on our computers. This method will you run your code with your credentials, and us run your code with our credentials.
You can just copy and paste the contents of oauthfile.py into your code. To use the readOAuthFile() function, you simply pass it the pathname where your credentials are stored. It will attempt to read this file. If it succeeds, it will return a 4-item tuple containing:
(access_token, access_secret, consumer_key, consumer_secret)
which you can then use to authenticate to Twitter.
If the function fails—because either the filename you provide to it doesn’t exist, or isn’t readable, or the data is in the incorrect format—then it will return None. You should check this return value and exit your program in this case (and fix the problem) because you won’t be able to authenticate to Twitter without these values.
How We’ll Run Your Program
When we test your program, we will expect to be able to run it by passing it the location of the OAuth file we’ll be using on the command line. For example:
python YourHomework3.py /Users/keith/oauthdata.txt
So your program should use the sys module and the sys.argv variable to get the command line argument and pass that to readOAuthFile() to get the authentication data. When you run your program, you’ll use the path to your OAuth file, and when we run it we’ll use the path to our OAuth file.
When your program runs, it should prompt the user for a search term. When the user enters the search term, the program should take that string and build a query to Twitter and display the results. Your program can then loop back to prompt the user again, if you choose.