Toronto Blue Jays Attendance

Authors

Executive Summary

The work below was conducted in June 2017 as an extension of a Masters of Management Analytics course project offered by Queen’s University. The high-level goal of the project was to determine the various factors that contribute to ticket sales at Toronto Blue Jays home games. Specifically, we wanted to explore the following areas:

  1. To what degree does winning, both within a season and carryover from past years, help increase attendance?
  2. What sort of impact do certain types of promotions and giveaways have on attendance?
  3. What sort of attendance can be expected for the remainder of 2017 (July, August, September, October)?

Major findings include:

The Process

Data

Methodology

Feature Engineering

#Create Vector, set first game to 1
final$GIS<-as.numeric(vector(length=nrow(final))) 
final$GIS[1] <- 1     

#Step through games. If the team changes increment GIS. If the year changes, reset to 1.
for (i in 2:nrow(final)) {
     if (final$Team[i] != final$Team[i-1]) 
     {final$GIS[i] <- 1}
     else {final$GIS[i] <- final$GIS[i-1]+1}
}
rm(i)
#Engineer Season Opening Games
Opener <- final %>%
  group_by(Year) %>%
  summarise(Date=min(Date))
Opener<-data.frame(Opener)
final$Opener <- as.numeric(final$Date %in% Opener$Date)

The Business Case

Descriptive Analytics

#Create binary variable for NYY, BOS
final<-final%>%
  mutate(BinRival=Team %in% c("BOS","NYY"))

#Proximity Variable for Driveable Distances
final<-final%>%
  mutate(Proximity=Team %in% c("CLE","CIN","NYY","CHC", "CWS","DET","PHI", "PIT"))

Modeling

Blackbox

#Set Model Seeds

set.seed(42)
seeds <- vector(mode = "list", length = 11) 
for(i in 1:10) {seeds[[i]]<- sample.int(n=100000, 10)}
seeds[[11]] <- sample.int(n=100000,1) 
rm(i)

#Set Parallel Computing
cl = makeCluster(2)
registerDoParallel(cl)

#Set caret parameters
kfolds=10
search.length=10

cvCtrl <- trainControl(method="cv",
                       number=kfolds,
                       seeds=seeds)

forest <- train(Attendance~., 
                method="rf",
                trControl=cvCtrl,
                data=final_train,
                importance=TRUE,
                tuneLength=search.length)

stopCluster(cl)
remove(cl)
registerDoSEQ()
## rf variable importance
## 
##   only 20 most important variables shown (out of 58)
## 
##              Overall
## Year          100.00
## Salary         77.47
## cosT           41.96
## Game           40.98
## Series         37.98
## GB             37.57
## sinT           36.93
## BinaryDay      36.64
## PromoAmount    36.39
## BinRivalTRUE   35.19
## Playoffs       34.16
## DayN           33.24
## DayNum         32.55
## MinTemp        25.90
## Rank           25.74
## LateSeasonGB   24.87
## MeanTemp       23.55
## Opener         22.09
## MonthNum       20.11
## Weekend1       19.46

## [1] "The MAPE of the random forest is 14.72 %"

When is our Predictive (Blackbox) Model Wrong?

Whitebox Model (Stepwise OLS Regression)

## 
## Call:
## lm(formula = Attendance ~ sinT + cosT + Salary + Day + Playoffs + 
##     PromoAmount + Weekday + Rank + Opener + Series + GB + BinRival + 
##     Streak7 + Bag + Leafs + Raps + Holiday + Figurine + OtherWear + 
##     MinTemp + Hat + LateSeasonGB + GIS, data = step_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17831  -3924   -354   3641  22840 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.448e+04  2.206e+03  11.096  < 2e-16 ***
## sinT             -1.310e+03  5.073e+02  -2.582  0.01000 *  
## cosT             -2.523e+03  5.656e+02  -4.461 9.37e-06 ***
## Salary            4.962e+02  6.597e+01   7.522 1.53e-13 ***
## DayN             -5.996e+03  1.029e+03  -5.828 8.27e-09 ***
## Playoffs          9.793e+03  9.047e+02  10.825  < 2e-16 ***
## PromoAmount       2.585e-01  4.474e-02   5.776 1.11e-08 ***
## WeekdayTuesday    2.356e+03  9.095e+02   2.591  0.00976 ** 
## WeekdayWednesday  1.641e+03  9.955e+02   1.648  0.09970 .  
## WeekdayThursday   1.801e+03  1.090e+03   1.652  0.09892 .  
## WeekdayFriday     4.467e+03  9.039e+02   4.941 9.55e-07 ***
## WeekdaySaturday   5.728e+03  1.278e+03   4.481 8.56e-06 ***
## WeekdaySunday     1.734e+03  1.383e+03   1.254  0.21039    
## Rank             -6.828e+02  2.808e+02  -2.432  0.01526 *  
## Opener            1.608e+04  2.787e+03   5.771 1.15e-08 ***
## Series            5.153e+02  6.927e+01   7.439 2.75e-13 ***
## GB               -1.730e+02  8.921e+01  -1.939  0.05291 .  
## BinRivalTRUE      3.925e+03  8.190e+02   4.792 1.99e-06 ***
## Streak7           5.160e+03  1.820e+03   2.836  0.00469 ** 
## Bag              -2.885e+03  1.626e+03  -1.774  0.07643 .  
## Leafs             3.473e+03  1.363e+03   2.548  0.01104 *  
## Raps              2.332e+03  1.104e+03   2.112  0.03505 *  
## Holiday           3.357e+03  1.620e+03   2.072  0.03856 *  
## FigurineNone     -3.643e+03  1.315e+03  -2.771  0.00572 ** 
## OtherWear         4.748e+03  1.917e+03   2.477  0.01347 *  
## MinTemp          -1.242e+02  6.738e+01  -1.844  0.06561 .  
## Hat               3.868e+03  1.891e+03   2.046  0.04108 *  
## LateSeasonGB     -1.474e+02  7.962e+01  -1.851  0.06460 .  
## GIS              -6.079e+02  3.737e+02  -1.627  0.10417    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6101 on 760 degrees of freedom
## Multiple R-squared:  0.6759, Adjusted R-squared:  0.664 
## F-statistic: 56.61 on 28 and 760 DF,  p-value: < 2.2e-16

## [1] "The MAPE of the OLS Regression is 18.49"

We see that our residuals are reasonably normal, with some fairly significant outliers.

2017 Predictions

First 28 Games

  Stepwise OLS Regression Random Forest
Model_MAPE 14.10 9.74
Model_RMSE 6355.81 5045.26

Predictions for Remaining Games

#First let's create the datasets for the remaining 53 games
BadRemain <- Remaining2017 %>%
  mutate(Rank = 5,
         GB = 12,
         MaxTemp = mean(final_train$MaxTemp),
         MinTemp = mean(final_train$MinTemp),
         MeanTemp = mean(final_train$MeanTemp),
         LateSeasonGB = ASO * GB)

MediumRemain <- Remaining2017 %>%
  mutate(Rank = 3,
         GB = 7,
         MaxTemp = mean(final_train$MaxTemp),
         MinTemp = mean(final_train$MinTemp),
         MeanTemp = mean(final_train$MeanTemp),
         LateSeasonGB = ASO * GB)

GoodRemain <- Remaining2017 %>%
  mutate(Rank = 1,
         GB = 2,
         MaxTemp = mean(final_train$MaxTemp),
         MinTemp = mean(final_train$MinTemp),
         MeanTemp = mean(final_train$MeanTemp),
         LateSeasonGB = ASO * GB)

#lets build a list and predict on them.
scenarios <- list(BadRemain, MediumRemain, GoodRemain)
preds_list <- lapply(scenarios, function(x){predict(forest, x)})

Next Steps

If you’re interested in learning more about the work that was done, please contact David Murray or Gage Sonntag.

Appendix

Structure of Dataset Used

## 'data.frame':    1050 obs. of  49 variables:
##  $ Game           : int  1 2 3 10 11 12 13 14 15 26 ...
##  $ Attendance     : num  47817 21003 13100 14239 20177 ...
##  $ Leafs          : int  1 1 1 0 1 1 1 1 1 1 ...
##  $ Raps           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Rank           : num  4 5 5 5 5 5 5 5 5 5 ...
##  $ GB             : num  1 1.5 2 2.5 3.5 4.5 5.5 6 5.5 7.5 ...
##  $ Day            : Factor w/ 2 levels "D","N": 1 2 2 2 1 1 2 2 2 2 ...
##  $ Streak         : num  0 -1 -2 1 -1 -2 -3 -4 -5 -1 ...
##  $ BinaryDay      : num  1 0 0 0 1 1 0 0 0 0 ...
##  $ Year           : num  2004 2004 2004 2004 2004 ...
##  $ Month          : Factor w/ 7 levels "Apr","May","Jun",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ MaxTemp        : num  2.5 7.4 9.9 12.8 18.4 11.6 8.7 17.2 15.3 9.4 ...
##  $ MinTemp        : num  -6.2 -2.7 3.1 4.8 6.9 7.9 4.3 5.8 8.5 1.6 ...
##  $ MeanTemp       : num  -1.9 2.4 6.5 8.8 12.7 9.8 6.5 11.5 11.9 5.5 ...
##  $ Precipitation  : num  0 3.8 0 0 3.4 10.6 0 11.8 0 0 ...
##  $ Salary         : num  4.68 4.68 4.68 5.16 5.16 ...
##  $ Hat            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Jersey         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Shirt          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Figurine       : Factor w/ 2 levels "Bobblehead","None": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Bag            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherWear      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ NonWear        : num  1 1 0 0 1 0 0 0 0 0 ...
##  $ Kids           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PromoAmount    : num  50000 40000 0 0 10000 0 0 0 0 0 ...
##  $ KidsTheme      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ AdultTheme     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ GIS            : num  1 2 3 1 2 3 1 2 3 1 ...
##  $ Opener         : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ Holiday        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Playoffs       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BinaryPrecip   : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ Weekday        : Factor w/ 7 levels "Monday","Tuesday",..: 1 2 3 5 6 7 2 3 4 1 ...
##  $ MonthNum       : num  4 4 4 4 4 4 4 4 4 5 ...
##  $ DayNum         : num  1 2 3 5 6 7 2 3 4 1 ...
##  $ sinT           : num  0 0.0347 0.0694 0.3726 0.4046 ...
##  $ cosT           : num  1 0.999 0.998 0.928 0.914 ...
##  $ Series         : num  1 1 1 2 2 2 3 3 3 4 ...
##  $ ASO            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Streak7        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Streak5        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Streak3        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ JunJul         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ AprMay         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ LateSeasonGB   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Weekend        : Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 1 1 1 ...
##  $ Weekday_Daytime: num  0 0 0 0 1 1 0 0 0 0 ...
##  $ BinRival       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Proximity      : logi  TRUE TRUE TRUE FALSE FALSE FALSE ...

Predictions for Remainder of Year

Date Opponent Weekday Bad Medium Good
2017-06-02 NYY Friday 38363 38616 38974
2017-06-03 NYY Saturday 43157 44085 44474
2017-06-04 NYY Sunday 43015 44139 44432
2017-06-13 TBR Tuesday 34994 35907 36527
2017-06-14 TBR Wednesday 33873 34413 34982
2017-06-16 CHW Friday 35610 36051 36720
2017-06-17 CHW Saturday 42278 43524 44327
2017-06-18 CHW Sunday 43023 44464 45095
2017-06-27 BAL Tuesday 36250 36750 37713
2017-06-28 BAL Wednesday 36301 36821 37621
2017-06-29 BAL Thursday 37567 38097 38703
2017-06-30 BOS Friday 39173 39430 39839
2017-07-01 BOS Saturday 44145 45418 45766
2017-07-02 BOS Sunday 44249 45475 45805
2017-07-06 HOU Thursday 36939 37547 39022
2017-07-07 HOU Friday 38056 38464 39862
2017-07-08 HOU Saturday 43609 44757 45603
2017-07-09 HOU Sunday 43515 44637 45399
2017-07-24 OAK Monday 36762 37581 40468
2017-07-25 OAK Tuesday 36963 37780 40518
2017-07-26 OAK Wednesday 37351 38192 41069
2017-07-27 OAK Thursday 40394 42074 43721
2017-07-28 LAA Friday 40203 40733 43795
2017-07-29 LAA Saturday 43759 45108 46590
2017-07-30 LAA Sunday 44100 45591 46455
2017-08-08 NYY Tuesday 38336 39201 41910
2017-08-09 NYY Wednesday 38565 39444 41721
2017-08-10 NYY Thursday 39810 40466 42688
2017-08-11 PIT Friday 38578 39706 42773
2017-08-12 PIT Saturday 43011 44589 45780
2017-08-13 PIT Sunday 43092 44578 45838
2017-08-14 TBR Monday 37211 38177 41065
2017-08-15 TBR Tuesday 37299 38334 41159
2017-08-16 TBR Wednesday 37846 38904 41776
2017-08-17 TBR Thursday 40524 42221 43849
2017-08-25 MIN Friday 39508 40409 43692
2017-08-26 MIN Saturday 42784 44483 46213
2017-08-27 MIN Sunday 42809 44459 46016
2017-08-28 BOS Monday 38960 39836 42532
2017-08-29 BOS Tuesday 38770 39527 41897
2017-08-30 BOS Wednesday 39390 40197 42290
2017-09-08 DET Friday 41590 42901 45381
2017-09-09 DET Saturday 43861 45523 46843
2017-09-10 DET Sunday 43795 45448 46827
2017-09-11 BAL Monday 37710 39279 42609
2017-09-12 BAL Tuesday 37680 39255 42387
2017-09-13 BAL Wednesday 38220 39719 42836
2017-09-19 KCR Tuesday 38343 40282 43381
2017-09-20 KCR Wednesday 38505 40435 43495
2017-09-21 KCR Thursday 39376 41129 44231
2017-09-22 NYY Friday 42611 43853 46107
2017-09-23 NYY Saturday 44048 45494 46785
2017-09-24 NYY Sunday 43681 45291 46340