Jonathan Lee Rossi

Exploring Baseball Fan Loyalty

August 25, 2016    |    Data Science

Data was collected using both the Twitter Search API and Twitter Stream API. Data spans ten days for Yankees and eleven days for Red Sox.

Questions to consider: Do we expect a lot of positive tweets near Boston on days when the Red Sox win? Do we expect negative tweets in other regions of the US? What about on days when they lose or don't play?

There are significant limitations present in our data. Data was collected at a random time each day, which may have been prior to, during, or after that day's game. Also, a set number of tweets were collected for each day (1000).

Red Sox

Though we would need to alter our data set to see this, another interesting question would be: is there a lag in sentiment? If the Red Sox lose today, do we see a spike in negative sentiment tomorrow? We would want to query to get all the data points that occur after a Win day or after a Loss day, rather than on a Win or Loss day.

$ python twitter_search.py > redsox_search_08_16.txt

Yankees

The following SQL query allows the 'coin-stacking' style of data points if desired.


WITH m AS (
 SELECT array_agg(cartodb_id) id_list, the_geom_webmercator,
 ST_Y(the_geom_webmercator) y
 FROM df_yankees
 GROUP BY the_geom_webmercator
 ORDER BY y DESC,
) f AS (
 SELECT  generate_series(1, array_length(id_list,1)) p, unnest(id_list)
 cartodb_id, the_geom_webmercator
 FROM m, TwitterStream
)
SELECT  ST_Translate(f.the_geom_webmercator,0,f.p*50) the_geom_webmercator, f.cartodb_id, q.text, q.retweet_count, q.dates, q.sentiment, q.hav_distance, q.outcome
FROM f, df_yankees q
WHERE f.cartodb_id = q.cartodb_id
    

In order to gain more insight, we will need much better data collection methods and more historical data.

# Import the necessary package to process data in JSON format
try:
    import json
except ImportError:
    import simplejson as json

# Import the necessary methods from "twitter" library
from twitter import Twitter, OAuth, TwitterHTTPError, TwitterStream

# Variables that contains the user credentials to access Twitter API 
ACCESS_TOKEN = 'YOUR ACCESS TOKEN"'
ACCESS_SECRET = 'YOUR ACCESS TOKEN SECRET'
CONSUMER_KEY = 'YOUR API KEY'
CONSUMER_SECRET = 'ENTER YOUR API SECRET'

oauth = OAuth(ACCESS_TOKEN, ACCESS_SECRET, CONSUMER_KEY, CONSUMER_SECRET)

# Initiate the connection to Twitter Streaming API
twitter_stream = TwitterStream(auth=oauth)

# Get a sample of the public data following through Twitter
iterator = twitter_stream.statuses.sample()

# Print each tweet in the stream to the screen 
# Here we set it to stop after getting 1000 tweets. 
# You don't have to set it to stop, but can continue running 
# the Twitter API to collect data for days or even longer. 
tweet_count = 1000
for tweet in iterator:
    tweet_count -= 1
    # Twitter Python Tool wraps the data returned by Twitter 
    # as a TwitterDictResponse object.
    # We convert it back to the JSON format to print/score
    print json.dumps(tweet)  
    
    # The command below will do pretty printing for JSON data, try it out
    # print json.dumps(tweet, indent=4)
       
    if tweet_count <= 0:
        break