Subset Selection and Diagnostics

Stepwise Regression

Which variables should you drop from your model? This is a common question for all modeling, but especially logistic regression. In this section we will cover a popular variable selection technique - stepwise regression. This isn’t the only possible technique, but will be the primary focus here.

We will be going back to the Ames, Iowa dataset for exploring these techniques.

Stepwise regression techniques involve the three common methods:

Forward Selection
Backward Selection
Stepwise Selection

These techniques add or remove (depending on the technique) one variable at a time from your regression model to try and “improve” the model. There are a variety of different selection criteria to use to add or remove variables from a logistic regression. Two common approaches are to use either p-values or one of the pair of AIC/BIC.

Although p-values are falling out of popularity, it is primarily because people often use the 0.05 significance level without any regards to sample size. Although 0.05 is a good significance level for a sample size around 50, this level should be adjusted based on sample size.

However, it can be shown that mathematically the AIC/BIC criterion for adding or removing variables with stepwise selection (which is becoming very popular) is the same thing as using p-values in likelihood ratio tests. AIC is calculated as follows:

\[ AIC = -2 \log(L) + 2p \] where \(L\) is the likelihood function and p is the number of variables being estimated in the model. Let’s compare two models - one with \(p\) variables and one with \(p+1\) variables. Assuming the additional variable lowers AIC, we can see the following relationship:

\[\begin{aligned} AIC_{p+1} &< AIC_p \\ -2 \log(L_{p+1}) + 2(p+1) &< -2 \log(L_p) + 2p \\ 2 &< 2(\log(L_{p+1}) - \log(L_p)) \\ \end{aligned}\]

The right hand side of the equation is a Likelihood Ratio Test that follows a \(\chi^2_1\) distribution. So we know the significance level from this LRT is the following:

\[ 1 - P(\chi^2_1 > 2) = 0.1573 \]

This means that the AIC selection method for stepwise selection is mathematically the same as a p-value stepwise selection technique with a significance level (\(\alpha\) level) of 0.1573 for continuous variables or binary categorical variables. More than two categories in a categorical variable would be greater than 1 degree of freedom in the calculation above.

The same math can be applied to the BIC selection technique as well. The BIC calculation is the following:

\[ BIC = -2 \log(L) + p \times \log(n) \]

Working through the math, this is also a Likelihood Ratio Test that follows a \(\chi^2_1\) distribution. So we know the significance level from this LRT is the following:

\[ 1 - P(\chi^2_1 > \log(n)) = \ldots \]

Notice how the significance level changes depending on sample size due to the BIC equation, which is exactly what is recommended for any selection technique using p-values.

With that all understood, let’s look at the three common stepwise regression approaches.

Stepwise Selection

Here we will work through stepwise selection. In stepwise selection, the initial model is empty (contains no variables, only the intercept). Each variable is then tried to see if it is significant based on AIC/BIC or p-value with a specified significance level. The most significant variable (or the one that reduces the AIC/BIC the most) is then added to the model. Next, all variables in the model (here only one) are tested to see if they are still significant (or don’t hinder the AIC/BIC if dropped). If not, they are dropped. If so, then the remaining variables are again tested and the next most impactful variable (p-value or AIC/BIC depending on the approach) is added to the model. This process repeats until either no more significant variables are available to add or the same variable keeps being added and then removed from the model based on AIC/BIC or p-value (depending on the approach).

Let’s look at this approach using all of our softwares!

R’s step function traditionally uses either AIC or BIC for adding or removing variables from the stepwise regression techniques.

In R we must specify two models - the empty and full models - to step between. Notice the full model contains all the variables, while the empty model contains only the intercept (signified by ~ 1 in the formula). We then use the step function where we start at the empty model. The scope option tells R the lower (lower =) and upper (upper =) model to step between. The direction = "both" option tells R to use stepwise selection.

Code

full.model <- glm(Bonus ~ Gr_Liv_Area + factor(House_Style) + Garage_Area +
                          Fireplaces + factor(Full_Bath) + factor(Half_Bath) + Lot_Area + 
                          factor(Central_Air) + Second_Flr_SF + TotRms_AbvGrd + First_Flr_SF,
                  data = train, family = binomial(link = "logit"))

empty.model <- glm(Bonus ~ 1, data = train, family = binomial(link = "logit"))

step.model <- step(empty.model,
                   scope = list(lower=formula(empty.model),
                                upper=formula(full.model)),
                   direction = "both")

Start:  AIC=2777.81
Bonus ~ 1

                      Df Deviance    AIC
+ factor(Full_Bath)    4   1911.5 1921.5
+ Gr_Liv_Area          1   1926.4 1930.4
+ Garage_Area          1   2135.4 2139.4
+ First_Flr_SF         1   2294.1 2298.1
+ Fireplaces           1   2423.7 2427.7
+ TotRms_AbvGrd        1   2449.7 2453.7
+ factor(House_Style)  7   2542.1 2558.1
+ factor(Half_Bath)    2   2608.1 2614.1
+ Lot_Area             1   2621.9 2625.9
+ Second_Flr_SF        1   2631.8 2635.8
+ factor(Central_Air)  1   2654.3 2658.3
<none>                     2775.8 2777.8

Step:  AIC=1921.48
Bonus ~ factor(Full_Bath)

                      Df Deviance    AIC
+ Garage_Area          1   1616.3 1628.3
+ Fireplaces           1   1659.5 1671.5
+ Gr_Liv_Area          1   1665.7 1677.7
+ First_Flr_SF         1   1711.7 1723.7
+ Lot_Area             1   1811.5 1823.5
+ factor(Half_Bath)    2   1812.2 1826.2
+ factor(Central_Air)  1   1827.6 1839.6
+ factor(House_Style)  7   1822.3 1846.3
+ TotRms_AbvGrd        1   1885.6 1897.6
+ Second_Flr_SF        1   1908.2 1920.2
<none>                     1911.5 1921.5
- factor(Full_Bath)    4   2775.8 2777.8

Step:  AIC=1628.26
Bonus ~ factor(Full_Bath) + Garage_Area

                      Df Deviance    AIC
+ Fireplaces           1   1440.9 1454.9
+ Gr_Liv_Area          1   1481.3 1495.3
+ First_Flr_SF         1   1540.6 1554.6
+ factor(Half_Bath)    2   1546.5 1562.5
+ factor(House_Style)  7   1571.5 1597.5
+ factor(Central_Air)  1   1585.6 1599.6
+ Lot_Area             1   1589.0 1603.0
+ TotRms_AbvGrd        1   1606.6 1620.6
+ Second_Flr_SF        1   1608.7 1622.7
<none>                     1616.3 1628.3
- Garage_Area          1   1911.5 1921.5
- factor(Full_Bath)    4   2135.4 2139.4

Step:  AIC=1454.86
Bonus ~ factor(Full_Bath) + Garage_Area + Fireplaces

                      Df Deviance    AIC
+ Gr_Liv_Area          1   1381.5 1397.5
+ factor(Half_Bath)    2   1389.8 1407.8
+ factor(House_Style)  7   1386.2 1414.2
+ First_Flr_SF         1   1413.8 1429.8
+ factor(Central_Air)  1   1426.2 1442.2
+ Lot_Area             1   1435.1 1451.1
+ Second_Flr_SF        1   1436.5 1452.5
<none>                     1440.9 1454.9
+ TotRms_AbvGrd        1   1440.3 1456.3
- Fireplaces           1   1616.3 1628.3
- Garage_Area          1   1659.5 1671.5
- factor(Full_Bath)    4   1943.5 1949.5

Step:  AIC=1397.55
Bonus ~ factor(Full_Bath) + Garage_Area + Fireplaces + Gr_Liv_Area

                      Df Deviance    AIC
+ factor(House_Style)  7   1310.0 1340.0
+ TotRms_AbvGrd        1   1335.1 1353.1
+ factor(Central_Air)  1   1354.3 1372.3
+ factor(Half_Bath)    2   1356.4 1376.4
+ First_Flr_SF         1   1368.2 1386.2
+ Second_Flr_SF        1   1370.7 1388.7
<none>                     1381.5 1397.5
+ Lot_Area             1   1381.2 1399.2
- Gr_Liv_Area          1   1440.9 1454.9
- Fireplaces           1   1481.3 1495.3
- Garage_Area          1   1546.1 1560.1
- factor(Full_Bath)    4   1626.6 1634.6

Step:  AIC=1339.99
Bonus ~ factor(Full_Bath) + Garage_Area + Fireplaces + Gr_Liv_Area + 
    factor(House_Style)

                      Df Deviance    AIC
+ factor(Half_Bath)    2   1261.2 1295.2
+ TotRms_AbvGrd        1   1268.8 1300.8
+ factor(Central_Air)  1   1288.2 1320.2
<none>                     1310.0 1340.0
+ Lot_Area             1   1309.7 1341.7
+ Second_Flr_SF        1   1309.9 1341.9
+ First_Flr_SF         1   1309.9 1341.9
- factor(House_Style)  7   1381.5 1397.5
- Gr_Liv_Area          1   1386.2 1414.2
- Fireplaces           1   1394.8 1422.8
- Garage_Area          1   1403.9 1431.9
- factor(Full_Bath)    4   1511.2 1533.2

Step:  AIC=1295.15
Bonus ~ factor(Full_Bath) + Garage_Area + Fireplaces + Gr_Liv_Area + 
    factor(House_Style) + factor(Half_Bath)

                      Df Deviance    AIC
+ TotRms_AbvGrd        1   1228.3 1264.3
+ factor(Central_Air)  1   1248.4 1284.4
+ First_Flr_SF         1   1259.0 1295.0
<none>                     1261.2 1295.2
+ Second_Flr_SF        1   1260.3 1296.3
+ Lot_Area             1   1260.7 1296.7
- factor(Half_Bath)    2   1310.0 1340.0
- Gr_Liv_Area          1   1331.8 1363.8
- Fireplaces           1   1335.7 1367.7
- Garage_Area          1   1336.7 1368.7
- factor(House_Style)  7   1356.4 1376.4
- factor(Full_Bath)    4   1490.6 1516.6

Step:  AIC=1264.3
Bonus ~ factor(Full_Bath) + Garage_Area + Fireplaces + Gr_Liv_Area + 
    factor(House_Style) + factor(Half_Bath) + TotRms_AbvGrd

                      Df Deviance    AIC
+ factor(Central_Air)  1   1218.9 1256.9
<none>                     1228.3 1264.3
+ First_Flr_SF         1   1226.7 1264.7
+ Lot_Area             1   1227.3 1265.3
+ Second_Flr_SF        1   1227.7 1265.7
- TotRms_AbvGrd        1   1261.2 1295.2
- factor(Half_Bath)    2   1268.8 1300.8
- Fireplaces           1   1291.2 1325.2
- Garage_Area          1   1294.7 1328.7
- factor(House_Style)  7   1315.8 1337.8
- Gr_Liv_Area          1   1330.3 1364.3
- factor(Full_Bath)    4   1453.8 1481.8

Step:  AIC=1256.88
Bonus ~ factor(Full_Bath) + Garage_Area + Fireplaces + Gr_Liv_Area + 
    factor(House_Style) + factor(Half_Bath) + TotRms_AbvGrd + 
    factor(Central_Air)

                      Df Deviance    AIC
<none>                     1218.9 1256.9
+ First_Flr_SF         1   1217.7 1257.7
+ Lot_Area             1   1218.0 1258.0
+ Second_Flr_SF        1   1218.5 1258.5
- factor(Central_Air)  1   1228.3 1264.3
- TotRms_AbvGrd        1   1248.4 1284.4
- factor(Half_Bath)    2   1253.1 1287.1
- Fireplaces           1   1272.7 1308.7
- Garage_Area          1   1272.9 1308.9
- factor(House_Style)  7   1300.0 1324.0
- Gr_Liv_Area          1   1324.1 1360.1
- factor(Full_Bath)    4   1428.7 1458.7

Code

summary(step.model)


Call:
glm(formula = Bonus ~ factor(Full_Bath) + Garage_Area + Fireplaces + 
    Gr_Liv_Area + factor(House_Style) + factor(Half_Bath) + TotRms_AbvGrd + 
    factor(Central_Air), family = binomial(link = "logit"), data = train)

Coefficients:
                                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)                         -1.041e+01  1.537e+00  -6.775 1.24e-11 ***
factor(Full_Bath)1                  -6.860e-01  1.325e+00  -0.518  0.60469    
factor(Full_Bath)2                   1.894e+00  1.341e+00   1.412  0.15785    
factor(Full_Bath)3                   4.152e+00  1.610e+00   2.579  0.00991 ** 
factor(Full_Bath)4                  -1.261e+00  2.493e+00  -0.506  0.61305    
Garage_Area                          3.583e-03  5.187e-04   6.907 4.96e-12 ***
Fireplaces                           9.142e-01  1.272e-01   7.186 6.67e-13 ***
Gr_Liv_Area                          3.827e-03  4.033e-04   9.488  < 2e-16 ***
factor(House_Style)One_and_Half_Unf -8.941e+00  3.682e+02  -0.024  0.98063    
factor(House_Style)One_Story         2.396e+00  3.285e-01   7.295 2.99e-13 ***
factor(House_Style)SFoyer            1.760e+00  6.382e-01   2.757  0.00583 ** 
factor(House_Style)SLvl              1.105e+00  4.530e-01   2.438  0.01476 *  
factor(House_Style)Two_and_Half_Fin -4.855e-01  6.945e+00  -0.070  0.94427    
factor(House_Style)Two_and_Half_Unf  8.329e-01  8.891e-01   0.937  0.34890    
factor(House_Style)Two_Story         9.801e-01  3.380e-01   2.900  0.00373 ** 
factor(Half_Bath)1                   1.195e+00  2.153e-01   5.553 2.81e-08 ***
factor(Half_Bath)2                  -1.301e-01  8.053e-01  -0.162  0.87163    
TotRms_AbvGrd                       -4.322e-01  8.128e-02  -5.317 1.05e-07 ***
factor(Central_Air)Y                 1.620e+00  5.866e-01   2.762  0.00575 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2775.8  on 2050  degrees of freedom
Residual deviance: 1218.9  on 2032  degrees of freedom
AIC: 1256.9

Number of Fisher Scoring iterations: 14

The model building summary is shown to have added variables based on lowest AIC. This may be different across different stepwise techniques.

At the time of writing this code deck, Python does not have nice capabilities to do this automatically in statsmodels, scikitlearn, or scipy. All resources found involve downloading and installing a package (mlxtend) that is not included by default in Anaconda or writing your own function. Scikit learn has something similar but uses the model’s coefficients to select, not p-values. This approach is completely unreasonable because coefficients reflect the units of the data. This means that smaller units in a variable lead to a smaller coefficient. The only way this would be appropriate is if all of the variables were standardized ahead of time. Scikitlearn has stepwise selection capabilities by evaluating a metric on cross-validation. However, cross-validation is not covered in this code deck. The corresponding Machine Learning code deck goes through this approach.

PROC LOGISTIC uses significance levels and p-values for adding or removing variables from the stepwise selection techniques. PROC LOGISTIC uses the selection = stepwise option to perform this technique. Notice also the slentry = and slstay = options where we specify the significance to both enter and stay in the model respectively. Here we chose 0.03 based on our sample size.

Code

ods html select ModelBuildingSummary ParameterEstimates ORPlot;
proc logistic data=logistic.ames_train plots(only)=(oddsratio);
    class House_Style Full_Bath Half_Bath Central_Air / param=ref;
    model Bonus(event='1') = Gr_Liv_Area House_Style Garage_Area Fireplaces Full_Bath Half_Bath Lot_Area 
                           Central_Air Second_Flr_SF TotRms_AbvGrd First_Flr_SF 
                           / selection=stepwise slentry=0.005 slstay=0.005 clodds=pl clparm=pl;
    title 'Modeling Bonus Eligibility';
run;
quit;

The model building summary is shown to have added some variables. This may be different across different stepwise techniques.

Backward Selection

Here we will work through backward selection. In backward selection, the initial model is full (contains all variables, including the intercept). Each variable is then tried to see if it is significant based on AIC/BIC or p-value with a specified significance level. The least significant variable (or the one that hinders the AIC/BIC the most) is then dropped from the model. Then the remaining variables are again tested and the next least significant variable is dropped from the model. This process repeats until no more insignificant variables are available to drop or the AIC/BIC no longer improves with the deletion of another variable.

Let’s look at this approach using all of our softwares!

In R we must specify one model - the full model - to step backward from. Notice the full model contains all the variables. We then use the step function where we start at the full model. The direction = "backward" option tells R to use backward selection.

Code

full.model <- glm(Bonus ~ Gr_Liv_Area + factor(House_Style) + Garage_Area +
                          Fireplaces + factor(Full_Bath) + factor(Half_Bath) + Lot_Area + 
                          factor(Central_Air) + Second_Flr_SF + TotRms_AbvGrd + First_Flr_SF,
                  data = train, family = binomial(link = "logit"))

back.model <- step(full.model, direction = "backward")

Start:  AIC=1257.6
Bonus ~ Gr_Liv_Area + factor(House_Style) + Garage_Area + Fireplaces + 
    factor(Full_Bath) + factor(Half_Bath) + Lot_Area + factor(Central_Air) + 
    Second_Flr_SF + TotRms_AbvGrd + First_Flr_SF

                      Df Deviance    AIC
- Lot_Area             1   1214.2 1256.2
- Gr_Liv_Area          1   1214.3 1256.3
<none>                     1213.6 1257.6
- Second_Flr_SF        1   1217.1 1259.1
- First_Flr_SF         1   1217.8 1259.8
- factor(Central_Air)  1   1222.3 1264.3
- factor(House_Style)  7   1241.0 1271.0
- TotRms_AbvGrd        1   1243.0 1285.0
- factor(Half_Bath)    2   1248.1 1288.1
- Garage_Area          1   1262.0 1304.0
- Fireplaces           1   1262.7 1304.7
- factor(Full_Bath)    4   1425.9 1461.9

Step:  AIC=1256.18
Bonus ~ Gr_Liv_Area + factor(House_Style) + Garage_Area + Fireplaces + 
    factor(Full_Bath) + factor(Half_Bath) + factor(Central_Air) + 
    Second_Flr_SF + TotRms_AbvGrd + First_Flr_SF

                      Df Deviance    AIC
- Gr_Liv_Area          1   1214.9 1254.9
<none>                     1214.2 1256.2
- Second_Flr_SF        1   1217.7 1257.7
- First_Flr_SF         1   1218.5 1258.5
- factor(Central_Air)  1   1223.0 1263.0
- factor(House_Style)  7   1241.1 1269.1
- TotRms_AbvGrd        1   1243.2 1283.2
- factor(Half_Bath)    2   1248.7 1286.7
- Garage_Area          1   1264.2 1304.2
- Fireplaces           1   1265.2 1305.2
- factor(Full_Bath)    4   1425.9 1459.9

Step:  AIC=1254.91
Bonus ~ factor(House_Style) + Garage_Area + Fireplaces + factor(Full_Bath) + 
    factor(Half_Bath) + factor(Central_Air) + Second_Flr_SF + 
    TotRms_AbvGrd + First_Flr_SF

                      Df Deviance    AIC
<none>                     1214.9 1254.9
- factor(Central_Air)  1   1223.8 1261.8
- factor(House_Style)  7   1243.0 1269.0
- TotRms_AbvGrd        1   1244.7 1282.7
- Second_Flr_SF        1   1245.5 1283.5
- factor(Half_Bath)    2   1249.4 1285.4
- Garage_Area          1   1265.8 1303.8
- Fireplaces           1   1266.0 1304.0
- First_Flr_SF         1   1313.4 1351.4
- factor(Full_Bath)    4   1426.0 1458.0

Code

summary(back.model)


Call:
glm(formula = Bonus ~ factor(House_Style) + Garage_Area + Fireplaces + 
    factor(Full_Bath) + factor(Half_Bath) + factor(Central_Air) + 
    Second_Flr_SF + TotRms_AbvGrd + First_Flr_SF, family = binomial(link = "logit"), 
    data = train)

Coefficients:
                                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)                         -1.028e+01  1.541e+00  -6.673 2.51e-11 ***
factor(House_Style)One_and_Half_Unf -9.208e+00  3.686e+02  -0.025  0.98007    
factor(House_Style)One_Story         2.062e+00  4.945e-01   4.171 3.04e-05 ***
factor(House_Style)SFoyer            1.464e+00  7.213e-01   2.030  0.04234 *  
factor(House_Style)SLvl              9.390e-01  4.891e-01   1.920  0.05489 .  
factor(House_Style)Two_and_Half_Fin  1.085e+00  6.908e+00   0.157  0.87524    
factor(House_Style)Two_and_Half_Unf  8.376e-01  8.904e-01   0.941  0.34687    
factor(House_Style)Two_Story         1.010e+00  3.498e-01   2.886  0.00390 ** 
Garage_Area                          3.499e-03  5.210e-04   6.716 1.87e-11 ***
Fireplaces                           8.965e-01  1.279e-01   7.010 2.39e-12 ***
factor(Full_Bath)1                  -6.540e-01  1.330e+00  -0.492  0.62302    
factor(Full_Bath)2                   1.930e+00  1.347e+00   1.433  0.15196    
factor(Full_Bath)3                   4.355e+00  1.618e+00   2.691  0.00712 ** 
factor(Full_Bath)4                  -1.073e+00  2.436e+00  -0.440  0.65971    
factor(Half_Bath)1                   1.228e+00  2.215e-01   5.545 2.94e-08 ***
factor(Half_Bath)2                  -6.069e-02  8.103e-01  -0.075  0.94030    
factor(Central_Air)Y                 1.590e+00  5.909e-01   2.690  0.00715 ** 
Second_Flr_SF                        3.466e-03  6.632e-04   5.226 1.73e-07 ***
TotRms_AbvGrd                       -4.339e-01  8.142e-02  -5.329 9.86e-08 ***
First_Flr_SF                         4.011e-03  4.351e-04   9.220  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2775.8  on 2050  degrees of freedom
Residual deviance: 1214.9  on 2031  degrees of freedom
AIC: 1254.9

Number of Fisher Scoring iterations: 14

The model building summary is shown to have removed variables based on lowest AIC. This may be different across different stepwise techniques.

At the time of writing this code deck, Python does not have nice capabilities to do this automatically in statsmodels, scikitlearn, or scipy. All resources found involve downloading and installing a package (mlxtend) that is not included by default in Anaconda or writing your own function. Scikit learn has something similar but uses the model’s coefficients to select, not p-values. This approach is completely unreasonable because coefficients reflect the units of the data. This means that smaller units in a variable lead to a smaller coefficient. The only way this would be appropriate is if all of the variables were standardized ahead of time. Scikitlearn has backward selection capabilities by evaluating a metric on cross-validation. However, cross-validation is not covered in this code deck. The corresponding Machine Learning code deck goes through this approach.

PROC LOGISTIC uses the selection = backward option to perform this technique. Notice also the slstay = option where we specify the significance to stay in the model. Here we chose 0.03 based on our sample size.

Code

ods html select ModelBuildingSummary ParameterEstimates ORPlot;
proc logistic data=logistic.ames_train plots(only)=(oddsratio);
    class House_Style Full_Bath Half_Bath Central_Air / param=ref;
    model Bonus(event='1') = Gr_Liv_Area House_Style Garage_Area Fireplaces Full_Bath Half_Bath Lot_Area 
                           Central_Air Second_Flr_SF TotRms_AbvGrd First_Flr_SF 
                           / selection=backward slstay=0.005 clodds=pl clparm=pl;
    title 'Modeling Bonus Eligibility';
run;
quit;

The model building summary is shown to have dropped some variables. This may be different across different stepwise techniques.

Forward with Interactions

Here we will work through forward selection. In forward selection, the initial model is empty (contains no variables, only the intercept). Each variable is then tried to see if it is significant based on AIC/BIC or p-value with a specified significance level. The most significant variable (or the one that reduces the AIC/BIC the most) is then added to the model. Then the remaining variables are again tested and the next most impactful variable (p-value or AIC/BIC depending on the approach) is added to the model. This process repeats until either no more significant variables are available to add to the model based on AIC/BIC or p-value (depending on the approach). This approach is the same as stepwise selection without the additional check at each step for possible removal.

Forward selection is the least used technique because stepwise selection does the same as forward selection with the added benefit of dropping insignificant variables. The main use for forward selection is to test higher order terms and interactions in models.

Let’s look at this approach using all of our softwares!

In R we must specify two models to step between. We then use the step function where we start at the empty model. The scope option tells R the lower (lower =) and upper (upper =) model to step between. The direction = "forward" option tells R to use stepwise selection.

Here we are testing some two-way interactions. This can be done in R by creating two models - the main effects model and the two-way interaction model. Here we set the starting point for forward selection at the main effects model and step up to the interaction model.

Code

main.model <- glm(Bonus ~ Gr_Liv_Area + factor(House_Style) + Garage_Area +
                          factor(Fireplaces) + factor(Full_Bath) + factor(Half_Bath) + Lot_Area + 
                          factor(Central_Air) + Second_Flr_SF + TotRms_AbvGrd + First_Flr_SF,
                  data = train, family = binomial(link = "logit"))

int.model <- glm(Bonus ~ Gr_Liv_Area + factor(House_Style) + Garage_Area +
                         Fireplaces + factor(Full_Bath) + factor(Half_Bath) + Lot_Area + 
                         factor(Central_Air) + Second_Flr_SF + TotRms_AbvGrd + First_Flr_SF +
                         Gr_Liv_Area*factor(House_Style) + TotRms_AbvGrd*factor(House_Style) + 
                         Gr_Liv_Area*factor(Fireplaces),
                  data = train, family = binomial(link = "logit"))

for.model <- step(main.model,
                  scope = list(lower=formula(main.model),
                               upper=formula(int.model)),
                  direction = "forward")

Start:  AIC=1259.21
Bonus ~ Gr_Liv_Area + factor(House_Style) + Garage_Area + factor(Fireplaces) + 
    factor(Full_Bath) + factor(Half_Bath) + Lot_Area + factor(Central_Air) + 
    Second_Flr_SF + TotRms_AbvGrd + First_Flr_SF

                                    Df Deviance    AIC
+ Gr_Liv_Area:factor(Fireplaces)     3   1159.0 1215.0
+ factor(House_Style):TotRms_AbvGrd  7   1180.8 1244.8
<none>                                   1209.2 1259.2
+ Gr_Liv_Area:factor(House_Style)    7   1202.5 1266.5

Step:  AIC=1215.05
Bonus ~ Gr_Liv_Area + factor(House_Style) + Garage_Area + factor(Fireplaces) + 
    factor(Full_Bath) + factor(Half_Bath) + Lot_Area + factor(Central_Air) + 
    Second_Flr_SF + TotRms_AbvGrd + First_Flr_SF + Gr_Liv_Area:factor(Fireplaces)

                                    Df Deviance    AIC
+ factor(House_Style):TotRms_AbvGrd  7   1125.4 1195.4
<none>                                   1159.0 1215.0
+ Gr_Liv_Area:factor(House_Style)    7   1154.9 1224.9

Step:  AIC=1195.42
Bonus ~ Gr_Liv_Area + factor(House_Style) + Garage_Area + factor(Fireplaces) + 
    factor(Full_Bath) + factor(Half_Bath) + Lot_Area + factor(Central_Air) + 
    Second_Flr_SF + TotRms_AbvGrd + First_Flr_SF + Gr_Liv_Area:factor(Fireplaces) + 
    factor(House_Style):TotRms_AbvGrd

                                  Df Deviance    AIC
+ Gr_Liv_Area:factor(House_Style)  7   1107.3 1191.3
<none>                                 1125.4 1195.4

Step:  AIC=1191.27
Bonus ~ Gr_Liv_Area + factor(House_Style) + Garage_Area + factor(Fireplaces) + 
    factor(Full_Bath) + factor(Half_Bath) + Lot_Area + factor(Central_Air) + 
    Second_Flr_SF + TotRms_AbvGrd + First_Flr_SF + Gr_Liv_Area:factor(Fireplaces) + 
    factor(House_Style):TotRms_AbvGrd + Gr_Liv_Area:factor(House_Style)

       Df Deviance    AIC
<none>      1107.3 1191.3

Code

summary(for.model)


Call:
glm(formula = Bonus ~ Gr_Liv_Area + factor(House_Style) + Garage_Area + 
    factor(Fireplaces) + factor(Full_Bath) + factor(Half_Bath) + 
    Lot_Area + factor(Central_Air) + Second_Flr_SF + TotRms_AbvGrd + 
    First_Flr_SF + Gr_Liv_Area:factor(Fireplaces) + factor(House_Style):TotRms_AbvGrd + 
    Gr_Liv_Area:factor(House_Style), family = binomial(link = "logit"), 
    data = train)

Coefficients: (1 not defined because of singularities)
                                                    Estimate Std. Error z value
(Intercept)                                       -1.065e+01  2.681e+00  -3.974
Gr_Liv_Area                                       -2.722e-03  4.630e-03  -0.588
factor(House_Style)One_and_Half_Unf               -9.155e+00  8.116e+03  -0.001
factor(House_Style)One_Story                       5.159e+00  2.206e+00   2.339
factor(House_Style)SFoyer                          7.133e-01  4.242e+00   0.168
factor(House_Style)SLvl                            7.311e+00  3.179e+00   2.300
factor(House_Style)Two_and_Half_Fin               -4.598e+01  7.473e+03  -0.006
factor(House_Style)Two_and_Half_Unf               -6.236e+00  9.834e+00  -0.634
factor(House_Style)Two_Story                       1.964e+00  2.230e+00   0.881
Garage_Area                                        3.792e-03  5.795e-04   6.544
factor(Fireplaces)1                               -3.668e-01  8.616e-01  -0.426
factor(Fireplaces)2                               -4.988e+00  2.253e+00  -2.214
factor(Fireplaces)3                                1.112e+01  2.542e+00   4.373
factor(Fireplaces)4                                1.533e+01  2.400e+03   0.006
factor(Full_Bath)1                                -2.696e+00  1.507e+00  -1.788
factor(Full_Bath)2                                 8.578e-02  1.518e+00   0.057
factor(Full_Bath)3                                 2.610e+00  1.771e+00   1.474
factor(Full_Bath)4                                -3.617e+00  4.630e+00  -0.781
factor(Half_Bath)1                                 1.336e+00  2.463e-01   5.423
factor(Half_Bath)2                                -1.039e+00  9.610e-01  -1.081
Lot_Area                                           1.607e-05  1.781e-05   0.902
factor(Central_Air)Y                               1.974e+00  6.645e-01   2.971
Second_Flr_SF                                      8.073e-03  4.603e-03   1.754
TotRms_AbvGrd                                     -7.848e-01  3.849e-01  -2.039
First_Flr_SF                                       9.010e-03  4.597e-03   1.960
Gr_Liv_Area:factor(Fireplaces)1                    8.676e-04  5.610e-04   1.546
Gr_Liv_Area:factor(Fireplaces)2                    4.200e-03  1.421e-03   2.955
Gr_Liv_Area:factor(Fireplaces)3                   -4.849e-03  1.058e-03  -4.582
Gr_Liv_Area:factor(Fireplaces)4                           NA         NA      NA
factor(House_Style)One_and_Half_Unf:TotRms_AbvGrd  1.290e+00  1.300e+03   0.001
factor(House_Style)One_Story:TotRms_AbvGrd        -2.019e-01  4.063e-01  -0.497
factor(House_Style)SFoyer:TotRms_AbvGrd            2.357e+00  8.592e-01   2.743
factor(House_Style)SLvl:TotRms_AbvGrd             -5.524e-01  7.252e-01  -0.762
factor(House_Style)Two_and_Half_Fin:TotRms_AbvGrd  2.603e+00  8.382e+02   0.003
factor(House_Style)Two_and_Half_Unf:TotRms_AbvGrd  1.204e+00  1.008e+00   1.194
factor(House_Style)Two_Story:TotRms_AbvGrd         8.543e-01  4.068e-01   2.100
Gr_Liv_Area:factor(House_Style)One_and_Half_Unf   -5.634e-03  7.159e+00  -0.001
Gr_Liv_Area:factor(House_Style)One_Story          -9.365e-04  1.937e-03  -0.483
Gr_Liv_Area:factor(House_Style)SFoyer             -9.801e-03  3.068e-03  -3.194
Gr_Liv_Area:factor(House_Style)SLvl               -1.487e-03  2.512e-03  -0.592
Gr_Liv_Area:factor(House_Style)Two_and_Half_Fin    6.811e-03  1.973e+00   0.003
Gr_Liv_Area:factor(House_Style)Two_and_Half_Unf   -1.516e-03  5.340e-03  -0.284
Gr_Liv_Area:factor(House_Style)Two_Story          -3.911e-03  1.862e-03  -2.100
                                                  Pr(>|z|)    
(Intercept)                                       7.07e-05 ***
Gr_Liv_Area                                        0.55662    
factor(House_Style)One_and_Half_Unf                0.99910    
factor(House_Style)One_Story                       0.01934 *  
factor(House_Style)SFoyer                          0.86645    
factor(House_Style)SLvl                            0.02146 *  
factor(House_Style)Two_and_Half_Fin                0.99509    
factor(House_Style)Two_and_Half_Unf                0.52598    
factor(House_Style)Two_Story                       0.37850    
Garage_Area                                       5.98e-11 ***
factor(Fireplaces)1                                0.67034    
factor(Fireplaces)2                                0.02685 *  
factor(Fireplaces)3                               1.22e-05 ***
factor(Fireplaces)4                                0.99490    
factor(Full_Bath)1                                 0.07370 .  
factor(Full_Bath)2                                 0.95492    
factor(Full_Bath)3                                 0.14052    
factor(Full_Bath)4                                 0.43468    
factor(Half_Bath)1                                5.86e-08 ***
factor(Half_Bath)2                                 0.27962    
Lot_Area                                           0.36696    
factor(Central_Air)Y                               0.00297 ** 
Second_Flr_SF                                      0.07948 .  
TotRms_AbvGrd                                      0.04145 *  
First_Flr_SF                                       0.05001 .  
Gr_Liv_Area:factor(Fireplaces)1                    0.12202    
Gr_Liv_Area:factor(Fireplaces)2                    0.00313 ** 
Gr_Liv_Area:factor(Fireplaces)3                   4.61e-06 ***
Gr_Liv_Area:factor(Fireplaces)4                         NA    
factor(House_Style)One_and_Half_Unf:TotRms_AbvGrd  0.99921    
factor(House_Style)One_Story:TotRms_AbvGrd         0.61913    
factor(House_Style)SFoyer:TotRms_AbvGrd            0.00608 ** 
factor(House_Style)SLvl:TotRms_AbvGrd              0.44621    
factor(House_Style)Two_and_Half_Fin:TotRms_AbvGrd  0.99752    
factor(House_Style)Two_and_Half_Unf:TotRms_AbvGrd  0.23258    
factor(House_Style)Two_Story:TotRms_AbvGrd         0.03573 *  
Gr_Liv_Area:factor(House_Style)One_and_Half_Unf    0.99937    
Gr_Liv_Area:factor(House_Style)One_Story           0.62877    
Gr_Liv_Area:factor(House_Style)SFoyer              0.00140 ** 
Gr_Liv_Area:factor(House_Style)SLvl                0.55389    
Gr_Liv_Area:factor(House_Style)Two_and_Half_Fin    0.99724    
Gr_Liv_Area:factor(House_Style)Two_and_Half_Unf    0.77654    
Gr_Liv_Area:factor(House_Style)Two_Story           0.03575 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2775.8  on 2050  degrees of freedom
Residual deviance: 1107.3  on 2009  degrees of freedom
AIC: 1191.3

Number of Fisher Scoring iterations: 15

Notice that the modeling summary contains the addition of a couple of these interactions.

At the time of writing this code deck, Python does not have nice capabilities to do this automatically in statsmodels, scikitlearn, or scipy. All resources found involve downloading and installing a package (mlxtend) that is not included by default in Anaconda or writing your own function. Scikit learn has something similar but uses the model’s coefficients to select, not p-values. This approach is completely unreasonable because coefficients reflect the units of the data. This means that smaller units in a variable lead to a smaller coefficient. The only way this would be appropriate is if all of the variables were standardized ahead of time. Scikitlearn has forward selection capabilities by evaluating a metric on cross-validation. However, cross-validation is not covered in this code deck. The corresponding Machine Learning code deck goes through this approach.

PROC LOGISTIC uses the selection = forward option to perform this technique. Notice also the slentry = option where we specify the significance to enter in the model. Here we chose 0.03 based on our sample size.

This approach can be done in PROC LOGISTIC by placing a “|” in between variables. The “@2” is used to specify that we want only two-way interactions between these variables. The include = option forces SAS to include the first eleven variables - our main effects. This forces SAS to check all two-way interactions one at a time without having all of them in the model from the beginning like backward selection with interactions would do.

Code

ods html select ModelBuildingSummary ParameterEstimates ORPlot;
proc logistic data=logistic.ames_train plots(only)=(oddsratio);
    class House_Style Full_Bath Half_Bath Central_Air / param=ref;
    model Bonus(event='1') = Gr_Liv_Area House_Style Garage_Area Fireplaces Full_Bath Half_Bath Lot_Area 
                           Central_Air Second_Flr_SF TotRms_AbvGrd First_Flr_SF 
                           Gr_Liv_Area|House_Style|Fireplaces @2 
                           / selection=forward slentry=0.05 clodds=pl clparm=pl include=11;
    title 'Modeling Bonus Eligibility';
run;
quit;

Here we can see that none of the interactions were added here.

Diagnostic Plots

Linear regression models contain residuals with properties that are very useful for model diagnostics. However, what is a residual in a logistic regression model? Since we don’t have actual probabilities to compare our predicted probabilities against, residuals are not as clearly defined. Instead we have pseudo “residuals” in logistic regression that we can explore further. Two examples of this are deviance residuals and Pearson residuals.

Deviance is a measure of how far a fitted model is from the fully saturated model. The fully saturated model is a model that predicts our data perfectly by essentially overfitting to it - a variable for each unique combination of inputs. This makes this model impractical for use, but good for comparison. The deviance is essentially our “error” from this “perfect” model. Logistic regression minimizes the sum of the squared deviances. Deviance residuals tell us how much each observation reduces the deviance.

Pearson residuals on the other hand tell us how much each observation changes the Pearson Chi-squared test for the overall model.

Other forms of measuring an observation’s influence on the logistic regression model are DFBetas and Cook’s D. Similar to their interpretation in linear regression, these two calculations tell us how each observation changes the estimation of each parameter individually (DFBeta) or how each observation changes the estimation of all the parameters holistically (Cook’s D).

Let’s see how to get all of these from our softwares!

R has some wonderful diagnostics plots to show us these residuals. R also produces a list of these measures of influence as well as many more with the influence.measures function. Below only the first 6 observations are shown using the head function, but this is calculated for each of the observations. The 4th plot in the plot function on the logistic regression model object is the Cook’s D plot as shown below. The dfbetasPlots function produces the DFBetas plots, but only one is shown here.

Code

logit.model <- glm(Bonus ~ Gr_Liv_Area + factor(House_Style) + Garage_Area +
                           Fireplaces + factor(Full_Bath) + Lot_Area + factor(Central_Air) +
                           TotRms_AbvGrd + Gr_Liv_Area:Fireplaces,
                   data = train, family = binomial(link = "logit"))
summary(logit.model)


Call:
glm(formula = Bonus ~ Gr_Liv_Area + factor(House_Style) + Garage_Area + 
    Fireplaces + factor(Full_Bath) + Lot_Area + factor(Central_Air) + 
    TotRms_AbvGrd + Gr_Liv_Area:Fireplaces, family = binomial(link = "logit"), 
    data = train)

Coefficients:
                                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)                         -1.118e+01  1.541e+00  -7.257 3.96e-13 ***
Gr_Liv_Area                          4.348e-03  4.804e-04   9.051  < 2e-16 ***
factor(House_Style)One_and_Half_Unf -9.494e+00  3.543e+02  -0.027 0.978620    
factor(House_Style)One_Story         2.168e+00  3.242e-01   6.687 2.27e-11 ***
factor(House_Style)SFoyer            1.320e+00  6.280e-01   2.102 0.035529 *  
factor(House_Style)SLvl              1.248e+00  4.421e-01   2.823 0.004753 ** 
factor(House_Style)Two_and_Half_Fin  1.999e-01  3.650e+00   0.055 0.956313    
factor(House_Style)Two_and_Half_Unf  8.763e-01  8.879e-01   0.987 0.323696    
factor(House_Style)Two_Story         1.656e+00  3.135e-01   5.284 1.27e-07 ***
Garage_Area                          3.709e-03  5.076e-04   7.307 2.73e-13 ***
Fireplaces                           1.817e+00  5.604e-01   3.242 0.001186 ** 
factor(Full_Bath)1                  -4.051e-01  1.198e+00  -0.338 0.735165    
factor(Full_Bath)2                   1.884e+00  1.198e+00   1.572 0.115958    
factor(Full_Bath)3                   3.716e+00  1.510e+00   2.461 0.013854 *  
factor(Full_Bath)4                  -2.103e+00  2.157e+00  -0.975 0.329563    
Lot_Area                             9.787e-06  1.233e-05   0.794 0.427237    
factor(Central_Air)Y                 2.087e+00  5.883e-01   3.547 0.000389 ***
TotRms_AbvGrd                       -4.835e-01  8.142e-02  -5.939 2.88e-09 ***
Gr_Liv_Area:Fireplaces              -5.791e-04  3.546e-04  -1.633 0.102470    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2775.8  on 2050  degrees of freedom
Residual deviance: 1250.0  on 2032  degrees of freedom
AIC: 1288

Number of Fisher Scoring iterations: 14

Code

library(car)

head(influence.measures(logit.model)$infmat)

        dfb.1_    dfb.Gr_L_A dfb.f(H_S)O__ dfb.f(H_S)O_S  dfb.f(H_S)SF
1 -0.006680632 -0.0072246691  5.026970e-06   0.002671788 -0.0038064360
2 -0.023071142 -0.1037195093  3.218790e-06   0.001777788 -0.0036062958
3  0.041070300 -0.0666518890 -5.618110e-05  -0.043455201 -0.0126647628
4 -0.003619157 -0.0045229884  4.163032e-06  -0.001883771 -0.0018893437
5 -0.000551361  0.0004377925  1.408439e-06   0.001199072  0.0006253354
6 -0.034257763  0.0431126015  2.029811e-05   0.022630715  0.0145381709
   dfb.f(H_S)SL dfb.f(H_S)T__H_F dfb.f(H_S)T__H_U dfb.f(H_S)T_S      dfb.Gr_A
1 -0.0035911372     9.150748e-04     0.0002096029  -0.002443371  0.0267391996
2 -0.0128525698    -3.377638e-03    -0.0078565055  -0.039414133  0.0158152975
3 -0.0128002414    -6.632733e-04    -0.0006218637   0.001313573 -0.0169465086
4 -0.0009298446    -1.285294e-04     0.0003646729   0.002090112  0.0182897403
5  0.0008280739     9.525692e-05     0.0003348043   0.001098389  0.0003387675
6  0.0107742453    -2.817776e-03    -0.0072876282  -0.020112955 -0.0238693611
       dfb.Frpl   dfb.f(F_B)1  dfb.f(F_B)2  dfb.f(F_B)3   dfb.f(F_B)4
1  0.0069303163  1.354963e-03 3.559191e-03 0.0045678387  1.350209e-03
2 -0.0054902484  7.121198e-03 4.333448e-03 0.0090488614  1.063928e-02
3 -0.0280704158  3.696606e-03 2.393670e-04 0.0070170929  2.398969e-03
4 -0.0014434218  6.992887e-04 8.272848e-04 0.0019502651 -1.236009e-03
5  0.0004213386 -1.076991e-06 8.822963e-05 0.0001125045 -1.170066e-05
6  0.0386676850  4.075327e-03 1.367289e-03 0.0011850399  1.774901e-03
       dfb.Lt_A      dfb.f(C_      dfb.TR_A   dfb.G_L_A:        dffit     cov.r
1 -6.896558e-04 -7.322808e-04  1.031403e-02 -0.003533782  0.040004607 1.0092774
2  3.005167e-02 -9.678781e-03  1.869810e-01  0.018362031 -0.232007770 0.9872442
3 -8.319612e-02 -2.455195e-02  1.578453e-02  0.047761640 -0.175638448 0.9666979
4 -2.522193e-03  2.814382e-04  6.476448e-03  0.002691227  0.021966934 1.0120976
5  9.081321e-05 -3.547551e-05 -6.129572e-05 -0.000324238 -0.001609538 1.0096657
6  1.083388e-02  5.567351e-03  2.724948e-02 -0.035826886 -0.117052324 1.0043255
        cook.d          hat
1 2.765406e-05 0.0037996009
2 1.650901e-03 0.0137632240
3 1.233413e-03 0.0058384415
4 7.966761e-06 0.0038368812
5 4.200650e-08 0.0003381361
6 2.814123e-04 0.0090119822

Code

plot(logit.model, 4)

Code

dfbetasPlots(logit.model, terms = "Gr_Liv_Area", id.n = 5,
             col = ifelse(logit.model$y == 1, "red", "blue"))

The DFBetas plots show the standardized impact of each observation on the calculation of each of the parameters in the model. The main thing to look for in these plots are points that are far away from the rest of the observations.

These observations are not necessarily bad per se, but have a large influence on the model. These points might need to be investigated further to see if they are actually valid observations.

Python has some wonderful diagnostics plots to show us these residuals. Python also produces a list of these measures of influence as well as many more with the get_influence function. Below only the first 5 observations are shown using the head function, but this is calculated for each of the observations.

Code

from statsmodels.genmod.families import Binomial
from statsmodels.genmod.generalized_linear_model import GLM

log_model = GLM.from_formula('Bonus ~ Gr_Liv_Area + Garage_Area + Fireplaces + C(Full_Bath) + Lot_Area + C(Central_Air) + TotRms_AbvGrd + Gr_Liv_Area:Fireplaces', data = train, family = Binomial()).fit()

log_model.summary()

Generalized Linear Model Regression Results
Dep. Variable:	Bonus	No. Observations:	2051
Model:	GLM	Df Residuals:	2039
Model Family:	Binomial	Df Model:	11
Link Function:	Logit	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-655.10
Date:	Mon, 24 Jun 2024	Deviance:	1310.2
Time:	12:41:59	Pearson chi2:	5.34e+05
No. Iterations:	7	Pseudo R-squ. (CS):	0.5106
Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-9.7344	1.441	-6.753	0.000	-12.560	-6.909
C(Full_Bath)[T.1]	-0.0390	1.143	-0.034	0.973	-2.279	2.201
C(Full_Bath)[T.2]	2.3699	1.143	2.073	0.038	0.129	4.610
C(Full_Bath)[T.3]	4.5001	1.519	2.963	0.003	1.523	7.477
C(Full_Bath)[T.4]	-1.2645	2.126	-0.595	0.552	-5.431	2.902
C(Central_Air)[T.Y]	2.2895	0.567	4.037	0.000	1.178	3.401
Gr_Liv_Area	0.0038	0.000	8.659	0.000	0.003	0.005
Garage_Area	0.0046	0.000	9.520	0.000	0.004	0.006
Fireplaces	1.9427	0.542	3.582	0.000	0.880	3.006
Lot_Area	1.568e-05	1.57e-05	1.000	0.317	-1.51e-05	4.64e-05
TotRms_AbvGrd	-0.5049	0.080	-6.342	0.000	-0.661	-0.349
Gr_Liv_Area:Fireplaces	-0.0006	0.000	-1.859	0.063	-0.001	3.45e-05

Code

import statsmodels.stats.tests.test_influence

log_diag = log_model.get_influence()
log_diag.summary_frame().head()

   dfb_Intercept  dfb_C(Full_Bath)[T.1]  ...  hat_diag  dffits_internal
0      -0.004889               0.001706  ...  0.003598         0.023667
1      -0.013708               0.004351  ...  0.011756        -0.167066
2       0.023485               0.001300  ...  0.006500        -0.135552
3      -0.002182               0.000401  ...  0.002478         0.008381
4      -0.000883              -0.000017  ...  0.000711        -0.002877

[5 rows x 16 columns]

Here the plot_influence function shows the points graphed by their studentized residuals as well as their influence measured by H leverage - yet another way to measure impact on a regression model. At the time of writing this code deck, Python did not have an easy functionality for DFBetas plots.

Code

from matplotlib import pyplot as plt

plt.cla()

log_diag.plot_influence()
plt.show()

SAS has some wonderful diagnostics plots to show us these residuals. In PROC LOGISTIC we ask for the influence and dfbetas plots from the plots = option.

Code

ods html select ParameterEstimates InfluencePlots DfBetasPlot;
proc logistic data=logistic.ames_train plots(only label)=(influence dfbetas);
    class House_Style Full_Bath Half_Bath Central_Air / param=ref;
    model Bonus(event='1') = Gr_Liv_Area House_Style Garage_Area Fireplaces Full_Bath Lot_Area 
                           Central_Air TotRms_AbvGrd Gr_Liv_Area|Fireplaces @2 
                           / clodds=pl clparm=pl;
    title 'Modeling Bonus Eligibility';
run;
quit;

The influence plots show a variety of plots. Among them are the Pearson residuals, Deviance residuals, and the difference in each of these for each observation. The main thing to look for in these plots are points that are far away from the rest of the observations.

These observations are not necessarily bad per se, but have a large influence on the model. These points might need to be investigated further to see if they are actually valid observations.

Calibration Curves

Another evaluation/diagnostic for logistic regression is the calibration curve. The calibration curve is a goodness-of-fit measure for logistic regression. Calibration measures how well predicted probabilities agree with actual frequency counts of outcomes (estimated linearly across the data set). These curves can help detect if predictions are consistently too high or low in your model.

If the curve is above the diagonal line, this indicates the model is predicting lower probabilities than actually observed. The opposite is true if the curve is below the diagonal line.

This is best used on larger samples since we are calculating the observed proportion of events in the data. In smaller samples this relationship is extrapolated out from the center and may not as accurately reflect the truth.

Let’s look at creating these in all of our softwares!

R produces a calibration curve using the givitiCalibrationBelt function. The inputs to this function are o = and e =. These are the observed target and expected target respectively. We place our actual target variable in the o = option and the predictions from our logistic regression model in the e = option. Since the model is being compared to training data the devel = internal option is specified. The maxDeg = option sets the maximum degree being tested for the curve.

Code

logit.model <- glm(Bonus ~ Gr_Liv_Area + factor(House_Style) + Garage_Area +
                           Fireplaces + factor(Full_Bath) + Lot_Area + factor(Central_Air) +
                           TotRms_AbvGrd + Gr_Liv_Area:Fireplaces,
                   data = train, family = binomial(link = "logit"))
summary(logit.model)


Call:
glm(formula = Bonus ~ Gr_Liv_Area + factor(House_Style) + Garage_Area + 
    Fireplaces + factor(Full_Bath) + Lot_Area + factor(Central_Air) + 
    TotRms_AbvGrd + Gr_Liv_Area:Fireplaces, family = binomial(link = "logit"), 
    data = train)

Coefficients:
                                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)                         -1.118e+01  1.541e+00  -7.257 3.96e-13 ***
Gr_Liv_Area                          4.348e-03  4.804e-04   9.051  < 2e-16 ***
factor(House_Style)One_and_Half_Unf -9.494e+00  3.543e+02  -0.027 0.978620    
factor(House_Style)One_Story         2.168e+00  3.242e-01   6.687 2.27e-11 ***
factor(House_Style)SFoyer            1.320e+00  6.280e-01   2.102 0.035529 *  
factor(House_Style)SLvl              1.248e+00  4.421e-01   2.823 0.004753 ** 
factor(House_Style)Two_and_Half_Fin  1.999e-01  3.650e+00   0.055 0.956313    
factor(House_Style)Two_and_Half_Unf  8.763e-01  8.879e-01   0.987 0.323696    
factor(House_Style)Two_Story         1.656e+00  3.135e-01   5.284 1.27e-07 ***
Garage_Area                          3.709e-03  5.076e-04   7.307 2.73e-13 ***
Fireplaces                           1.817e+00  5.604e-01   3.242 0.001186 ** 
factor(Full_Bath)1                  -4.051e-01  1.198e+00  -0.338 0.735165    
factor(Full_Bath)2                   1.884e+00  1.198e+00   1.572 0.115958    
factor(Full_Bath)3                   3.716e+00  1.510e+00   2.461 0.013854 *  
factor(Full_Bath)4                  -2.103e+00  2.157e+00  -0.975 0.329563    
Lot_Area                             9.787e-06  1.233e-05   0.794 0.427237    
factor(Central_Air)Y                 2.087e+00  5.883e-01   3.547 0.000389 ***
TotRms_AbvGrd                       -4.835e-01  8.142e-02  -5.939 2.88e-09 ***
Gr_Liv_Area:Fireplaces              -5.791e-04  3.546e-04  -1.633 0.102470    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2775.8  on 2050  degrees of freedom
Residual deviance: 1250.0  on 2032  degrees of freedom
AIC: 1288

Number of Fisher Scoring iterations: 14

Code

library(givitiR)

cali.curve <- givitiCalibrationBelt(o = train$Bonus,
                                    e = predict(logit.model, type = "response"),
                                    devel = "internal",
                                    maxDeg = 5)
plot(cali.curve, main = "Bonus Eligibility Model Calibration Curve",
                 xlab = "Predicted Probability",
                 ylab = "Observed Bonus Eligibility")

$m
[1] 3

$p.value
[1] 0

Since the diagonal line is contained in the confidence interval for our calibration curve, we do not notice any significant number of over or under predictions.

Python does not produce a calibration curve by default, but we can easily create it ourselves. After building the model with the GLM.from_forumla function and outputting the predicted probabilities from our training data using the predict function, we then sort these predicted probabilities using the sort_values function.

We then plot our our predicted probabilities. We then use the LOESS algorithm (using the statsmodels.api.nonparametric.lowess function) to fit a LOESS regression to our target variable (Bonus) using our predicted probabilities (Pred). We then plot this LOESS curve which is our calibration curve.

Code

log_model = GLM.from_formula('Bonus ~ Gr_Liv_Area + C(House_Style) + Garage_Area + Fireplaces + C(Full_Bath) + Lot_Area + C(Central_Air) + TotRms_AbvGrd + Gr_Liv_Area:Fireplaces', data = train, family = Binomial()).fit()

log_model.summary()

Generalized Linear Model Regression Results
Dep. Variable:	Bonus	No. Observations:	2051
Model:	GLM	Df Residuals:	2032
Model Family:	Binomial	Df Model:	18
Link Function:	Logit	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-624.98
Date:	Mon, 24 Jun 2024	Deviance:	1250.0
Time:	12:42:02	Pearson chi2:	2.44e+06
No. Iterations:	21	Pseudo R-squ. (CS):	0.5248
Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-11.1847	1.541	-7.257	0.000	-14.205	-8.164
C(House_Style)[T.One_and_Half_Unf]	-16.4940	1.17e+04	-0.001	0.999	-2.3e+04	2.3e+04
C(House_Style)[T.One_Story]	2.1678	0.324	6.687	0.000	1.532	2.803
C(House_Style)[T.SFoyer]	1.3203	0.628	2.102	0.036	0.089	2.551
C(House_Style)[T.SLvl]	1.2481	0.442	2.823	0.005	0.382	2.115
C(House_Style)[T.Two_and_Half_Fin]	0.1999	3.650	0.055	0.956	-6.954	7.354
C(House_Style)[T.Two_and_Half_Unf]	0.8763	0.888	0.987	0.324	-0.864	2.617
C(House_Style)[T.Two_Story]	1.6564	0.314	5.284	0.000	1.042	2.271
C(Full_Bath)[T.1]	-0.4051	1.198	-0.338	0.735	-2.753	1.942
C(Full_Bath)[T.2]	1.8837	1.198	1.572	0.116	-0.465	4.232
C(Full_Bath)[T.3]	3.7159	1.510	2.461	0.014	0.757	6.675
C(Full_Bath)[T.4]	-2.1029	2.157	-0.975	0.330	-6.330	2.124
C(Central_Air)[T.Y]	2.0871	0.588	3.547	0.000	0.934	3.240
Gr_Liv_Area	0.0043	0.000	9.051	0.000	0.003	0.005
Garage_Area	0.0037	0.001	7.307	0.000	0.003	0.005
Fireplaces	1.8169	0.560	3.242	0.001	0.719	2.915
Lot_Area	9.787e-06	1.23e-05	0.794	0.427	-1.44e-05	3.39e-05
TotRms_AbvGrd	-0.4835	0.081	-5.939	0.000	-0.643	-0.324
Gr_Liv_Area:Fireplaces	-0.0006	0.000	-1.633	0.102	-0.001	0.000

Code


train['Pred'] = log_model.predict(train)

Code

import numpy as np
import statsmodels.api as sm

train_sort = train.sort_values(by=['Pred'])

smoothed = sm.nonparametric.lowess(exog=train_sort['Pred'], endog=train_sort['Bonus'], frac=0.85)

plt.cla()
plt.plot(smoothed[:, 0], smoothed[:, 1])
plt.show()

Since the calibration curve estimation is relatively linear, we do not notice any significant number of over or under predictions.

SAS does not produce a calibration curve by default, but we can easily create it ourselves. After building the model with PROC LOGISTIC and outputting the predicted probabilities from our training data using the OUTPUT statement, we then sort these predicted probabilities using PROC SORT. We do this to easily plot the predicted probabilities in the SGPLOT procedure.

In PROC SGPLOT we use our predicted probabilities data set - here called cali. We then use the LOESS statement to fit a LOESS regression to our target variable (Bonus) using our predicted probabilities (PredProb). We allow for a cubic interpolation on this LOESS regression. The cml option produces a confidence interval around this curve. We then plot the diagonal line through the origin with a slope of 1. This produces our calibration curve.

Code

proc logistic data=logistic.ames_train noprint;
  class House_Style Full_Bath Half_Bath Central_Air / param=ref;
    model Bonus(event='1') = Gr_Liv_Area House_Style Garage_Area Fireplaces Full_Bath Lot_Area 
                           Central_Air TotRms_AbvGrd Gr_Liv_Area|Fireplaces @2 
                           / clodds=pl clparm=pl;
    output out=cali predicted=PredProb;
run;

proc sort data=cali;
    by PredProb;
run;

proc sgplot data=cali noautolegend aspect=1;
   loess x=PredProb y=Bonus / interpolation=cubic clm;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash);
   title 'Calibration Curve for Bonus Eligibility Data';
run;

Since the diagonal line is contained in the confidence interval for our calibration curve, we do not notice any significant number of over or under predictions.