EECS 349 Project - Jason Blacher

Final Report

The task I am exploring is attempting to model and predict the scoring trends of NBA players; more specifically, I intend to predict the number of points scored by various players during games of the 2018 NBA playoffs. To achieve this, I aim to figure out if any possible correlation can be found between player performance and social media content, primarily on Twitter. Social media becomes extremely active during important sports contests, as fans, experts, and even players post about games before, during, and after they actually take place. Twitter is the perfect place to analyze social media data, as it primarily consists of short, text-based interactions, is easily searchable, and has a great deal of users across the world. Tweets are quick, easy to send, and even more easily read, shared, and liked. By analyzing the prevalence and popularity of tweets about NBA players, perhaps their performance in games can be analyzed as well.

This topic is inherently very interesting because if any sort of connection can be found between social media and actual game performance, it could lead to better predictions for player production in games, scoring trends, and overall game results, as well as foster more effective coaching strategies and scouting. Any data analysis to try to predict the outcomes of the ever-erratic field of sports, let alone the individual performances of players in those games, would have enormous potential for many applications. Additionally, with the rise in popularity of social media over the last decade, people have become more and more reliant on it for communication, discussion, and information acquisition. Using this immensely widespread and opportunistic platform as a catalyst for data analysis and machine learning is an incredibly lucrative possibility, and analyzing sports could be just the beginning of one of many potential applications social media can provide us with.

To approach this problem, data acquisition was first required. In order to extract Twitter data, I wrote a custom script using the Tweepy API to repeatedly search for tweets matching a specific set of given parameters. For each team throughout the Conference Semifinals, Conference Finals, and NBA Finals, I selected several high-performing players (three from each team for the Conference Semifinals and Conference Finals, and five from each team for the NBA Finals) and searched for tweets containing each player’s name on each day that the team played a game. After searching up to 5000 tweets (a number dictated by time constraints and Twitter’s restrictions on how often you can query tweets) per player per game, I recorded the number of tweets found with favorites (likes), the number of favorites, as well as the ratio of favorites per tweet. Repeating this throughout all 37 games of the three final rounds of the playoffs resulted in a total of 238 data points.

Along with attributes from Twitter, other attributes for each data point were added to explore whether or not they would improve the final model’s accuracy. These attributes included the player’s team, the day of the week that the game was played on (in terms of weekday vs. weekend), whether the game was home or away, and the player’s point average for this season. Since the tweets gathered from Twitter had no guarantee of having been tweeted before the start of the game (only the day of the tweet could be specified, not the time), the outcome of the game (win/loss) was also included as an attribute to explore whether it would help improve the model. However, if the model were to be used to predict player scoring before a game actually started, this attribute would have to be eliminated.

After collecting data, I compiled all examples together for modeling. I ran through several iterations of modeling, using various parts of the data, to explore how different classifiers would respond. I first tested classifiers with only the attributes from Twitter, and then tested classifiers using all of the attributes for comparison. For each of these tests, I also used several different ways of representing the output of the classifiers. First, I tried to model the data continuously, with the classifier generating a specific numerical value, representing the predicted number of points scored by that player in that game. Then, I converted the continuous output of points scored to classifications representing ranges of five points. Each range was assigned a letter, so if a player scored 0-4 points, they were assigned “A”, if they scored 5-9 points, the were assigned a “B”, and so on. Finally, I repeated this using a range of ten points (where “A” = 0-9 points, “B” = 10-19 points, etc.). Altogether, this resulted in six iterations of modeling, each providing different results.

During the actual modeling process, I tested each classifier in Weka using 10-Fold Cross-Validation, the results of which I used as an indicator of the overall accuracy of the model. For each of the six modeling iterations described above, I tried every possible classifier in Weka, recording the results of notable classifiers in the table below. If the most accurate classifier was not one of the notable classifiers, I also recorded its accuracy in the table. For the trials in which the output of the model was continuous, I used the correlation coefficient as an indicator of performance, with values close to 1 being more accurate. For trials in which the output was a classification of a range of points scored, accuracy was indicated by the percentage of correct classifications during testing. See Figure 1 below for the full results of the modeling process.

Figure 1: Model Performance Over Iterations of Classifier Testing

As shown in Figure 1 above, the top performance of models using only attributes from Twitter data is very similar to that of models taking into account all attributes. For the classification output trials, the models using only Twitter data even slightly outperformed the models using all attributes, showing that that attributes from the Twitter data are significant factors in predicting the final outcome of the classifier. In terms of the overall performance of the models, they all perform decently, but definitely leave something to be desired. The models predicting ranges of points are slightly lower than I would like, as an accuracy below 50% is not incredibly helpful in actually predicting scoring. However, due to the inherent difficulty of predicting player performance in sporting events, this is understandable.

The classifiers predicting continuous score predictions are, in my opinion, better performing as well as more useful in terms of visualizing predicted scores. A correlation coefficient of around 0.7 constitutes a relatively accurate predictor, in the case of such a difficult task as predicting player scoring in basketball games. The top-performing continuous classifiers using both sets of attributes had relatively low error, with an average error of around 5 points for each model. The prediction output of these top-performing models is shown in Figures 2 and 4 below, with larger data points representing larger prediction error. Both models also had satisfactory error distributions, with most predictions being off by less than 5-7 points, as shown below in Figures 3 and 5.

Figure 2: Continuous Output Predictions Using All Attributes

Figure 3: Error Distribution for Continuous Output Predictions Using All Attributes

Figure 4: Continuous Output Predictions Using Only Attributes from Twitter

Figure 5: Error Distribution for Continuous Output Predictions Using Only Attributes from Twitter

Overall, I believe these to be satisfactory results. Given that this is merely in introductory investigation into a correlation between social media activity and sports performance, I did not expect my models to perform perfectly, and I am optimistic that if it were looked into further, this area could prove to be very valuable in predicting sports outcomes as well as player performance.

Next steps for future work would first be to collect more data from Twitter. Given more time, I would be interested in seeing how a model would perform if it were trained on an entire season’s worth of data. I would also like to query and record information on more tweets, as I was forced to limit the number I viewed due to Twitter’s usage restrictions. I would like to incorporate more variations on the content of tweets, as compared to viewing only tweets containing exactly players’ names. Finally, I would like to develop a way to treat the sources of tweets differently, as ideally, a tweet from a verified sports analyst should hold more weight than a tweet from a fan with a few followers. Expanding the dataset and making it more robust in these ways could lead to more accurate models and better overall scoring predictions.