EECS 349 Project - Jason Blacher

Abstract

Task

The task I am exploring is attempting to model and predict the scoring trends of NBA players; more specifically, I intend to predict the number of points scored by various players during games of the 2018 NBA playoffs. To achieve this, I aim to figure out if any possible correlation can be found between player performance and social media content, primarily on Twitter. This topic is inherently very interesting because if any sort of connection can be found between social media and actual game performance, it could lead to better predictions for player production in games, scoring trends, and overall game results, as well as foster more effective coaching strategies and scouting.

Approach

To approach this problem, I extracted Twitter data using a custom script that repeatedly searched for tweets matching a specific set of given parameters. I selected several high-performing players from each team, and searched for tweets containing each player’s name on each day that the team played a game. I recorded the number of tweets found with favorites (likes), the number of favorites, as well as the ratio of favorites per tweet. Repeating this throughout all 37 games of the three final rounds of the playoffs resulted in a total of 238 data points. Along with the attributes from Twitter, other attributes for each data point were added to explore whether or not they would improve the final model’s accuracy, including the player’s team, whether the game was played on a weekday or a weekend, whether the game was home or away, the player’s point average for this season, and the outcome of the game. Using Weka, I tested models trained on only the attributes from Twitter, as well as on all of the attributes. I also tested classifiers generating continuous predictions, as well as two types of categorical outputs representing ranges of points scored. Each iteration, I tested every classifier in Weka and recorded the results of several notable models, as well as the model with the best performance. Performance was based on testing with 10-fold cross-validation, with the correlation coefficient used as an indicator of accuracy for continuous predictions, and simple classification accuracy used as an indicator for the categorical predictions. These results can be found in Figure 1 below.

Figure 1: Model Performance Over Iterations of Classifier Testing

Results

The top performance of models using only attributes from Twitter data is very similar to that of models taking into account all attributes. For the classification output trials, the models using only Twitter data even slightly outperformed the models using all attributes, showing that that attributes from the Twitter data are significant factors in predicting the final outcome of the classifier. In terms of the overall performance of the models, they all perform decently, but definitely leave something to be desired. However, due to the inherent difficulty of predicting player performance in sporting events, this is understandable. The classifiers predicting continuous score predictions are, in my opinion, better performing as well as more useful in terms of visualizing predicted scores. A correlation coefficient of around 0.7 constitutes a relatively accurate predictor, in the case of such a difficult task as predicting player scoring in basketball games. The top-performing continuous classifiers using both sets of attributes had relatively low error, with an average error of around 5 points for each model.