Q 31Q 31FANTASY BASEBALL (A)
It was February, and John Hanke, a retired professor of statistics, was preparing for another fantasy baseball season. In past years, his fellow players had always teased him about using his knowledge of statistics to gain an advantage. Unfortunately, it had never been true. Teaching, researching, publishing, and committee work had kept him far too busy. Now, having recently retired, he finally had the time to apply his knowledge of statistics to the annual rotisserie draft. In this type of fantasy league, each manager has $260 with which to bid on and purchase 23 players (14 hitters and 9 pitchers). Each team is then ranked (based on actual player statistics from the previous season) in eight statistical categories. Dr. Hanke was very concerned with choosing players who would perform well on three out of the four pitching categories. In past years, his pitching staff, especially his starting pitchers, had been the laughing stock of the league. The 2007 season was going to be different. He intended to develop models to accurately forecast pitching performances for starting pitchers.
The three categories that Hanke wished to research were wins (WINS), earned run average (ERA), and walks and hits given up per innings pitched (WHIP). He had spent a considerable amount of time downloading baseball statistics for starting pitchers from the 2006 season. He intended to develop a multiple regression model to forecast each of the three categories of interest. He had often preached to his students that the initial variable selection was the most important aspect of developing a regression model. He knew that, if he didn't have good predictor variables, he wouldn't end up with useful prediction equations. After a considerable amount of work, Dr. Hanke chose the five potential predictor variables that follow. He also decided to include only starting pitchers who had pitched at least 100 innings during the season. A portion of the data for the 138 starting pitchers selected is presented in Table 7-23.
The variables are defined as
ERA: Earned run average or the number of earned runs allowed per game (nine innings pitched) WHIP: Number of walks plus hits given up per inning pitched
WHIP: Number of walks plus hits given up per inning pitched
CMD: Command of pitches, the ratio strikeouts/walks
K/9: How many batters a pitcher strikes out per game (nine innings pitched)
HR/9: Opposition homeruns per game (nine innings pitched)
OBA: Opposition batting average
THROWS: Right-handed pitcher (1) or lefthanded pitcher (0)
The next step in the analysis was the creation of the correlation matrix shown in Table 7-24.
Dr. Hanke found the correlations between ERA and WHIP and between ERA and OBA to be the same,.825. Moreover, the correlation between WHIP and OBA,.998, indicated these variables are strongly linearly related. Consequently, pitchers who performed well on one of these variables should perform well on the other two. Dr. Hanke decided that ERA is the best indicator of performance and decided to run a regression to see how well the variables in his collection predicted ERA. He knew, however, that the very high correlation between WHIP and OBA would create a multicollinearity problem, so only one of these variables would be required in the regression function. Tossing a coin, Dr. Hanke selected OBA and ran a regression with ERA as the dependent variable and the remaining variables, with the exception of WHIP, as independent variables. The results are shown in Table 7-25.
From Table 7-25, the independent variables THROWS and K/9 are not significant, given the other variables in the regression function. Moreover, the small VIF s suggest that THROWS and K/9 can be dropped together and the coefficients for the remaining variables will not change much. Table 7-26 shows the result when both THROWS and K/9 are left out of the model. The is 78.1%, and the equation looks good. The t statistic for each of the predictor variables is large with a very small p- value. The VIF s are small for the three predictors, indicating that multicollinearity is no longer a problem.
Dr. Hanke decided that he has a good model and developed the residual plots shown in Figure 7-4.
TABLE 7-23 Pitching Statistics for 138 Starting Pitchers
*The full data set is available on website: www.prenhall.com/hanke under Chapter 7 Case 7-3.
TABLE 7-24 Correlations: ERA, THROWS, WHIP, K/9, CMD, HR/9, OBA
TABLE 7-25 Minitab Regression Output Using All Predictor Variables Except WHIP
TABLE 7-26 Minitab Final Regression Output for Forecasting ERA
FIGURE 7-4 Residual Plots for Forecasting ERA
Develop a model to forecast ERA using the predictor variable WHIP instead of OBA. Which model do you prefer, the one with OBA as a predictor variable or the one with WHIP as a predictor variable? Why?