Generalized Additive Models

General Structure

Generalized Additive Models (GAM’s) provide a general framework for adding of non-linear functions together instead of the typical linear structure of linear regression. GAM’s can be used for either continuous or categorical target variables. The structure of GAM’s are the following:

\[ y = \beta_0 + f_1(x_1) + \cdots + f_p(x_p) + \varepsilon \]

The \(f_i(x_i)\) functions are complex, nonlinear functions on the predictor variables. GAM’s add these complex, yet individual functions together. This allows for many complex relationships to try and model with to potentially predict your target variable better. We will examine a few different forms of GAM’s below.

Piecewise Linear Regression

The slope of the linear relationship between a predictor variable and a target variable can change over different values of the predictor variable. The typical straight-line model \(\hat{y} = \beta_0 + \beta_1x_1\) will not be a good fit for this type of data.

Here is an example of a data set that exhibits this behavior - comprehensive strength of concrete and the proportion of water mixed with cement. The comprehensive strength decreases at a much faster rate for batches with a greater than 70% water/cement ratio.

If you were to fit a linear regression as the first image above, it wouldn’t represent the data very well. However, this is perfect for a piecewise linear regression. Piecewise linear regression is a model where there are different straight-line relationships for different intervals in the predictor variable. The following piecewise linear regression is for two slopes:

\[ y = \beta_0 + \beta_1x_1 + \beta_2(x_1-k)x_2 + \varepsilon \]

The \(k\) value in the equation above is called the knot value for \(x_1\). The \(x_2\) variable is defined as a value of 1 when \(x_1 > k\) and a value of 0 when \(x_1 \le k\). With \(x_2\) defined this way, when \(x_1 \le k\), the equation becomes \(y = \beta_0 + \beta_1x_1 + \varepsilon\). When \(x1 > k\), the equation gets a new intercept and slope: \(y = (\beta_0 - k\beta_2) + (\beta_1 + \beta_2)x_1 + \varepsilon\).

Let’s see this in each of our softwares!

R
Python

Piecewise linear regression is built with typical linear regression functions with only some creation of new variables. In R, we can use the lm function on this cement data set. We are predicting STRENGTH using the RATIO variable as well the X2STAR variable which is \((x_1 - k)x_2\).

Code

cement.lm <- lm(STRENGTH ~ RATIO + X2STAR, data = cement)

summary(cement.lm)


Call:
lm(formula = STRENGTH ~ RATIO + X2STAR, data = cement)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.72124 -0.09753 -0.00163  0.24297  0.49393 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.79198    0.67696  11.510 7.62e-09 ***
RATIO       -0.06633    0.01123  -5.904 2.89e-05 ***
X2STAR      -0.10119    0.02812  -3.598  0.00264 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3286 on 15 degrees of freedom
Multiple R-squared:  0.9385,    Adjusted R-squared:  0.9303 
F-statistic: 114.4 on 2 and 15 DF,  p-value: 8.257e-10

Code

ggplot(cement, aes(x = RATIO, y = STRENGTH)) +
  geom_point() +
  geom_line(data = cement, aes(x = RATIO, y = cement.lm$fitted.values)) +
  ylim(0,6)

We can see in the plot above how at the knot value of 70, the slope and intercept of the regression line changes.

The previous example dealt with piecewise functions that are continuous - the lines stay attached. However, you could make a small adjustment to the model to make the linear discontinuous:

\[ y = \beta_0 + \beta_1x_1 + \beta_2(x_1 - k)x_2 + \beta_3x_2 + \varepsilon \]

With the addition of the same \(x_2\) variable as previously defined on its own instead of attached to the \((x_1-k)\) piece, the lines are no longer attached.

Code

cement.lm <- lm(STRENGTH ~ RATIO + X2STAR + X2, data = cement)

summary(cement.lm)


Call:
lm(formula = STRENGTH ~ RATIO + X2STAR + X2, data = cement)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.53167 -0.15513  0.06171  0.17239  0.49451 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.04975    0.68558  10.283  6.6e-08 ***
RATIO       -0.05240    0.01174  -4.463 0.000536 ***
X2STAR      -0.07888    0.02686  -2.937 0.010830 *  
X2          -0.60388    0.26877  -2.247 0.041302 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2916 on 14 degrees of freedom
Multiple R-squared:  0.9548,    Adjusted R-squared:  0.9451 
F-statistic: 98.57 on 3 and 14 DF,  p-value: 1.188e-09

Code

qplot(RATIO, STRENGTH, group = X2, geom = c('point', 'smooth'), method = 'lm', data = cement, ylim = c(0,6))

Piecewise linear regression is built with typical linear regression functions with only some creation of new variables. In Python, we can use the ols function from statsmodels.formula.api on this cement data set. We are predicting STRENGTH using the RATIO variable as well the X2STAR variable which is \((x_1-k)x_2\).

Code

import statsmodels.formula.api as smf
from matplotlib import pyplot as plt
import seaborn as sns

cement_lm = smf.ols("STRENGTH ~ RATIO + X2STAR", data = cement).fit()

cement_lm.summary()

OLS Regression Results
Dep. Variable:	STRENGTH	R-squared:	0.938
Model:	OLS	Adj. R-squared:	0.930
Method:	Least Squares	F-statistic:	114.4
Date:	Fri, 25 Oct 2024	Prob (F-statistic):	8.26e-10
Time:	13:50:02	Log-Likelihood:	-3.8688
No. Observations:	18	AIC:	13.74
Df Residuals:	15	BIC:	16.41
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	7.7920	0.677	11.510	0.000	6.349	9.235
RATIO	-0.0663	0.011	-5.904	0.000	-0.090	-0.042
X2STAR	-0.1012	0.028	-3.598	0.003	-0.161	-0.041

Omnibus:	1.877	Durbin-Watson:	2.303
Prob(Omnibus):	0.391	Jarque-Bera (JB):	1.074
Skew:	-0.597	Prob(JB):	0.585
Kurtosis:	2.930	Cond. No.	582.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Code

plt.cla()

fig, ax = plt.subplots(figsize=(6, 4))
sns.scatterplot(x = 'RATIO', y='STRENGTH', data = cement, ax = ax)

x_pred = cement['RATIO']
y_pred = cement_lm.fittedvalues

sns.lineplot(x = x_pred, y = y_pred, ax = ax)

plt.show()

We can see in the plot above how at the knot value of 70, the slope and intercept of the regression line changes.

The previous example dealt with piecewise functions that are continuous - the lines stay attached. However, you could make a small adjustment to the model to make the linear discontinuous:

\[ y = \beta_0 + \beta_1x_1 + \beta_2(x_1-k)x_2 + \beta_3x_2 + \varepsilon \]

With the addition of the same \(x_2\) variable as previously defined on its own instead of attached to the \((x_1-k)\) piece, the lines are no longer attached.

Code

cement_lm = smf.ols("STRENGTH ~ RATIO + X2STAR + X2", data = cement).fit()

cement_lm.summary()

OLS Regression Results
Dep. Variable:	STRENGTH	R-squared:	0.955
Model:	OLS	Adj. R-squared:	0.945
Method:	Least Squares	F-statistic:	98.57
Date:	Fri, 25 Oct 2024	Prob (F-statistic):	1.19e-09
Time:	13:50:03	Log-Likelihood:	-1.0976
No. Observations:	18	AIC:	10.20
Df Residuals:	14	BIC:	13.76
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	7.0498	0.686	10.283	0.000	5.579	8.520
RATIO	-0.0524	0.012	-4.463	0.001	-0.078	-0.027
X2STAR	-0.0789	0.027	-2.937	0.011	-0.136	-0.021
X2	-0.6039	0.269	-2.247	0.041	-1.180	-0.027

Omnibus:	0.555	Durbin-Watson:	2.336
Prob(Omnibus):	0.758	Jarque-Bera (JB):	0.446
Skew:	-0.340	Prob(JB):	0.800
Kurtosis:	2.634	Cond. No.	678.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Code

plt.cla()

fig, ax = plt.subplots(figsize=(6, 4))
sns.scatterplot(x = 'RATIO', y='STRENGTH', data = cement, ax = ax)

x_pred = cement['RATIO']
y_pred = cement_lm.fittedvalues

sns.lineplot(x = x_pred, y = y_pred, ax = ax)

plt.show()

Although the plot above looks like the line pieces are attached, that is just the visualization of the plot itself. There is no prediction between the two lines.

The piecewise linear regression equation can be extended to have as many pieces as you want. An example with three lines (two knots) is as follows:

\[ y = \beta_0 +\beta_1x_1 + \beta_2(x_1-k_1)x_2 + \beta_3(x_1-k_2)x_3 + \varepsilon \]

One of the problems with this structure is that we have to define the knot values ourselves. The next set of models can help do that for us!

MARS (and EARTH)

Multivariate adaptive regression splines (MARS) is a non-parametric technique that still has a linear form to the model (additive) but has nonlinearities and interaction between variables. Essentially, MARS uses piecewise regression approach to split into pieces then potentially uses either linear or nonlinear patterns for each piece.

MARS first looks for the point in the range of a predictor \(x_i\) where two linear functions on either side of the point provides the least squared error (linear regression).

The algorithm continues on each piece of the piecewise function until many knots are found.

This will eventually overfit your data. However, the algorithm then works backwards to “prune” (or remove) the knots that do not contribute significantly to out of sample accuracy. This out of sample accuracy calculation is performed by using the generalized cross-validation (GCV) procedure - a computational short-cut for leave-one-out cross-validation. The algorithm does this for all of the variables in the data set and combines the outcomes together.

The actual MARS algorithm is trademarked by Salford Systems, so instead the common implementation in most softwares is enhanced adaptive regression through hinges - called EARTH.

Let’s see how to do this in each of our softwares!

R
Python

Let’s go back to our Ames housing data set and the variables we were working with in the previous section. One of the variables in our data set is Garage_Area. It doesn’t have a straight-forward relationship with our target variable Sale_Price as seen by the plot below.

Code

ggplot(training, aes(x = Garage_Area, y = Sale_Price)) +
  geom_point()

Let’s fit the EARTH algorithm between Garage_Area and Sale_Price. In the earth package is the earth function. The input is similar to most modeling functions in R, a formula to relate predictor variables to a target variable and an option to define the data set being used. We will then look at a summary of the output.

Code

library(earth)

mars1 <- earth(Sale_Price ~ Garage_Area, data = training)
summary(mars1)

Call: earth(formula=Sale_Price~Garage_Area, data=training)

                    coefficients
(Intercept)           124159.039
h(286-Garage_Area)       -60.257
h(Garage_Area-286)       297.277
h(Garage_Area-521)      -483.642
h(Garage_Area-576)       733.859
h(Garage_Area-758)      -356.460
h(Garage_Area-1043)     -490.873

Selected 7 of 7 terms, and 1 of 1 predictors
Termination condition: RSq changed by less than 0.001 at 7 terms
Importance: Garage_Area
Number of terms at each degree of interaction: 1 6 (additive model)
GCV 3427475346    RSS 6.94092e+12    GRSq 0.4492014    RSq 0.4556309

From the output above we see 6 pieces of the function defined by 5 knots. Those five knots correspond to the Garage_Area - ___ values above. The coefficients attached to each of those pieces are the same as what we would have in piecewise linear regression. The bottom of the output also shows the generalized \(R^2\) value as well as the typical \(R^2\) value.

To visualize the piecewise relationship between Garage_Area and Sale_Price we can plot the predicted values on the scatterplot from above.

Code

ggplot(training, aes(x = Garage_Area, y = Sale_Price)) +
  geom_point() +
  geom_line(data = training, aes(x = Garage_Area, y = mars1$fitted.values), color = "blue")

We can see that the Sale_Price of the home stays relative steady for small values of Garage_Area but then increases to a point, before it begins to level off again.

Now let’s build the algorithm on all the variables in the data set that we have. The Sale_Price ~ . notation tells the earth function to use all the variables in the data set to predict the Sale_Price.

Code

mars2 <- earth(Sale_Price ~ ., data = training)
summary(mars2)

Call: earth(formula=Sale_Price~., data=training)

                      coefficients
(Intercept)              319493.46
Central_AirY              20289.49
h(4-Bedroom_AbvGr)         9214.66
h(Bedroom_AbvGr-4)       -23009.05
h(Year_Built-1977)         1275.57
h(2004-Year_Built)         -336.64
h(Year_Built-2004)         5315.57
h(13869-Lot_Area)            -2.09
h(Lot_Area-13869)             0.22
h(First_Flr_SF-1600)        104.91
h(2402-First_Flr_SF)        -71.56
h(First_Flr_SF-2402)       -176.61
h(1523-Second_Flr_SF)       -53.13
h(Second_Flr_SF-1523)       426.63
h(Half_Bath-1)           -45378.31
h(2-Fireplaces)          -14408.56
h(Fireplaces-2)          -26072.58
h(Garage_Area-539)          101.97
h(Garage_Area-1043)        -294.30
h(Gr_Liv_Area-2049)          65.21
h(Gr_Liv_Area-3194)        -159.79

Selected 21 of 24 terms, and 10 of 14 predictors
Termination condition: Reached nk 29
Importance: First_Flr_SF, Second_Flr_SF, Year_Built, Garage_Area, ...
Number of terms at each degree of interaction: 1 20 (additive model)
GCV 1033819964    RSS 2.036439e+12    GRSq 0.8338641    RSq 0.8402842

Now that all of the variables have been added in, we see a lot of them remaining in the model to predict Sale_Price. There are knot values defined for all of the variables that are in the model. Right below the knot values in the output above we see that only 10 of the 14 original variables were used in the final model. Two lines below that we see variables listed by importance. We will look at this more below. Not surprisingly, the \(R^2\) and generalized \(R^2\) has increased with the addition of all these new variables. Notice how Garage_Area has different knot values then when we ran the algorithm on Garage_Area alone. That is because the algorithm prunes the knots with all of the variables in the model. Apparently, some of the other variables being in the model means we don’t need as many knots in the Garage_Area variable.

Let’s talk more about that variable importance metric in the above output. For each model size (1 term, 2 terms, etc.) there is one “subset” model - the best model for that size. EARTH ranks variables by how many of these “best models” of each size that variable appears in. The more subsets (or “best models”) that a variable appears in, the more important the variable. We can get this full output using the evimp function on our earth model object.

Code

evimp(mars2)

              nsubsets   gcv    rss
First_Flr_SF        20 100.0  100.0
Second_Flr_SF       19  71.7   71.9
Year_Built          18  50.9   51.3
Garage_Area         17  34.3   35.0
Fireplaces          16  31.0   31.7
Gr_Liv_Area         15  27.6   28.4
Central_AirY        12  20.0   20.9
Bedroom_AbvGr       11  18.1   19.0
Lot_Area            10  16.2   17.2
Half_Bath            4   7.4    8.2

The nsubsets above is the number of subsets that the variable appears in. The rss above stands for residual sum of squares (or sum of squares error) is a scaled version of the decrease in residual sum of squares relative to the previous subset. Since it is scaled, the top variable always has a value of 100 while the remaining ones decrease from there. The gcv value is an approximation of rss on leave-one-out cross-validation and is also scaled.

At the writing of these notes, there was not a stable version of the MARS / EARTH algorithm that worked in the latest versions of numpy and scipy. The py-earth contributed package for scikit-learn has not been updated since 2017.

Interpretability of relationships between predictor variables and the target variable starts to get more complicated with the EARTH (or MARS) algorithm. You can plot the relationship as we see above, but those relationships can still be rather complicated and hard to explain to a client.

Smoothing

Generalized additive models can be made up of any non-parametric function of the predictor variables. Another popular technique is to use smoothing functions so the piecewise linear regressions are not so jagged. The following are different types of smoothing functions:

LOESS (localized regression)
Smoothing splines & regression splines

LOESS

Locally estimated scatterplot smoothing (LOESS) is a popular smoothing technique. The idea of LOESS is to perform weighted linear regression in small windows of a scatterplot of data between two variables. This weighted linear regression is done around each point as the window moves from the low end of the scatterplot values to the high end. An example is shown below:

The predictions of the these regression lines in each window are connected together to form the smoothed curve through the scatterplot as shown above.

Smoothing Splines

Smoothing splines take a different approach as compared to LOESS. Smoothing splines have a knot at every single observation for piecewise regression which leads to overfitting. There is a penalty parameter used to counterbalance the “wiggle” of the spline curve.

Smoothing splines try to find the function \(s(x_i)\) that optimally fits \(x_i\) to the target variable through the following equation:

\[ \min\sum_{i=1}^n (y_i - s(x_i))^2 + \lambda\int s''(t_i)^2 dt \]

By thinking of \(s(x_i)\) as a prediction of \(y\), the front half of the equation is equal to the sum of squared errors in your model. The second half of the equation above has the \(\lambda\) penalty applied to the integral of the second derivative of the smoothing function. To conceptually think of this second derivative we can think of it as the “slope of slopes” which is large when the curve has a lot of “wiggle”. The optimal value of the \(\lambda\) penalty is estimated with another approximation of leave-one-out cross-validation.

Regression splines are just a computationally nicer version of smoothing splines so they will not be covered in detail here.

Let’s see how to do GAM’s with splines in each of our softwares!

R
Python

Continuing to use our Ames housing data set, we will build a gam using the mgcv package in R. Similar to previous functions the inputs are the formula for the model and the data = option to define the data set. We will also use the summary function to view the output. Inside of the formula, we use the s function to inform the gam function to which variables should have splines fit to them.

Code

library(mgcv)

gam1 <- mgcv::gam(Sale_Price ~ s(Garage_Area), data = training)
summary(gam1)


Family: gaussian 
Link function: identity 

Formula:
Sale_Price ~ s(Garage_Area)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   180897       1290   140.2   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                 edf Ref.df     F p-value    
s(Garage_Area) 8.134  8.769 192.5  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.451   Deviance explained = 45.3%
GCV = 3.4301e+09  Scale est. = 3.4148e+09  n = 2051

From the output above we see two different sections - a section for coefficients that are not involved in splines and a section for smoothing terms. The p-value attached to the spline of Garage_Area shows the significance of that variable to the model as a whole. Similar to the EARTH algorithm, we can view a plot of the relationship between the variable and its predictions of the target. Here we use the plot function on the gam model object.

Code

plot(gam1)

This nonlinear and complex relationship between Garage_Area and Sale_Price is similar to the plot we saw earlier with EARTH. This shouldn’t be too surprising. Both algorithms are trying to relate these two variables together, just in different ways.

Let’s build out a GAM with all of the variables in our data set. The categorical variables are entered as either character variables or with the factor function. The continuous variables are defined with the spline function s.

Code

gam2 <- mgcv::gam(Sale_Price ~ s(Bedroom_AbvGr, k = 5) + 
                               s(Year_Built) +
                               s(Mo_Sold) +
                               s(Lot_Area) +
                               s(First_Flr_SF) + 
                               s(Second_Flr_SF) +
                               s(Garage_Area) +
                               s(Gr_Liv_Area) +
                               s(TotRms_AbvGrd) +
                               Street +
                               Central_Air +
                               factor(Fireplaces) + 
                               factor(Full_Bath) +
                               factor(Half_Bath)
                  , method = 'REML', data = training)
summary(gam2)


Family: gaussian 
Link function: identity 

Formula:
Sale_Price ~ s(Bedroom_AbvGr, k = 5) + s(Year_Built) + s(Mo_Sold) + 
    s(Lot_Area) + s(First_Flr_SF) + s(Second_Flr_SF) + s(Garage_Area) + 
    s(Gr_Liv_Area) + s(TotRms_AbvGrd) + Street + Central_Air + 
    factor(Fireplaces) + factor(Full_Bath) + factor(Half_Bath)

Parametric coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)           136140      19681   6.917 6.19e-12 ***
StreetPave             27689      12710   2.178   0.0295 *  
Central_AirY           18012       3168   5.685 1.50e-08 ***
factor(Fireplaces)1    14070       1666   8.443  < 2e-16 ***
factor(Fireplaces)2    27137       3146   8.626  < 2e-16 ***
factor(Fireplaces)3    15705      10552   1.488   0.1368    
factor(Fireplaces)4   -79595      31469  -2.529   0.0115 *  
factor(Full_Bath)1     -5342      14528  -0.368   0.7131    
factor(Full_Bath)2    -11075      14827  -0.747   0.4552    
factor(Full_Bath)3      1225      15786   0.078   0.9381    
factor(Full_Bath)4    -16272      24326  -0.669   0.5036    
factor(Half_Bath)1      2102       2206   0.953   0.3408    
factor(Half_Bath)2    -38507       9111  -4.226 2.48e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                   edf Ref.df       F p-value    
s(Bedroom_AbvGr) 2.653  3.165  18.789  <2e-16 ***
s(Year_Built)    6.445  7.543 101.759  <2e-16 ***
s(Mo_Sold)       1.516  1.868   0.993  0.4507    
s(Lot_Area)      7.186  8.193  11.726  <2e-16 ***
s(First_Flr_SF)  8.063  8.765  15.548  <2e-16 ***
s(Second_Flr_SF) 8.212  8.818   7.806  <2e-16 ***
s(Garage_Area)   7.426  8.328  21.654  <2e-16 ***
s(Gr_Liv_Area)   8.545  8.882  14.834  <2e-16 ***
s(TotRms_AbvGrd) 3.805  4.738   1.921  0.0783 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.851   Deviance explained = 85.6%
-REML =  23957  Scale est. = 9.2392e+08  n = 2051

The top half of the output has the variables not in splines, while the bottom half has the spline variables.

There are some variables with high p-values that could be removed from the model. One of the benefits of the gam function from the mgcv package is the select option. If we set select = TRUE then the model will penalize variables’ edf values. You can think of an edf value almost like a polynomial term. The selection technique will zero out this edf value - essentially, zeroing out the variable itself.

An example is shown below where the Mo_Sold variable is essentially zeroed from the model.

Code

sel.gam2 <- mgcv::gam(Sale_Price ~ s(Bedroom_AbvGr, k = 5) + 
                                   s(Year_Built) +
                                   s(Mo_Sold) +
                                   s(Lot_Area) +
                                   s(First_Flr_SF) + 
                                   s(Second_Flr_SF) +
                                   s(Garage_Area) +
                                   s(Gr_Liv_Area) +
                                   s(TotRms_AbvGrd) +
                                   Street +
                                   Central_Air +
                                   factor(Fireplaces) + 
                                   factor(Full_Bath) +
                                   factor(Half_Bath)
                      , method = 'REML', select = TRUE, data = training)
summary(sel.gam2)


Family: gaussian 
Link function: identity 

Formula:
Sale_Price ~ s(Bedroom_AbvGr, k = 5) + s(Year_Built) + s(Mo_Sold) + 
    s(Lot_Area) + s(First_Flr_SF) + s(Second_Flr_SF) + s(Garage_Area) + 
    s(Gr_Liv_Area) + s(TotRms_AbvGrd) + Street + Central_Air + 
    factor(Fireplaces) + factor(Full_Bath) + factor(Half_Bath)

Parametric coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)           138290      19777   6.993 3.67e-12 ***
StreetPave             27999      12938   2.164   0.0306 *  
Central_AirY           17771       3209   5.538 3.46e-08 ***
factor(Fireplaces)1    13513       1694   7.974 2.55e-15 ***
factor(Fireplaces)2    26758       3216   8.319  < 2e-16 ***
factor(Fireplaces)3    15390      10733   1.434   0.1518    
factor(Fireplaces)4   -73384      31848  -2.304   0.0213 *  
factor(Full_Bath)1     -7674      14559  -0.527   0.5982    
factor(Full_Bath)2    -13216      14816  -0.892   0.3725    
factor(Full_Bath)3      3140      15766   0.199   0.8422    
factor(Full_Bath)4    -36178      23069  -1.568   0.1170    
factor(Half_Bath)1      2612       2222   1.176   0.2399    
factor(Half_Bath)2    -37561       9267  -4.053 5.25e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                    edf Ref.df      F  p-value    
s(Bedroom_AbvGr) 2.4004      4 14.886  < 2e-16 ***
s(Year_Built)    6.9457      9 83.935  < 2e-16 ***
s(Mo_Sold)       0.0380      9  0.004  0.34177    
s(Lot_Area)      7.5606      9 11.776  < 2e-16 ***
s(First_Flr_SF)  8.6024      9 32.355  < 2e-16 ***
s(Second_Flr_SF) 0.9441      9  1.706 4.15e-06 ***
s(Garage_Area)   6.6895      9 19.622  < 2e-16 ***
s(Gr_Liv_Area)   4.8394      9  7.623  < 2e-16 ***
s(TotRms_AbvGrd) 3.7644      9  1.297  0.00801 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.845   Deviance explained = 84.9%
-REML =  24081  Scale est. = 9.6643e+08  n = 2051

Continuing to use our Ames housing data set, we will build a GAM using the GLMGam and BSplines functions from the stasmodels.gam.api package in Python. Similar to previous functions the inputs are the formula for the model and the data = option to define the data set. We will also use the smoother option to let the GLMGam function know which variables are being splined. The BSplines function used as an input to the GLMGam function is where we define the variables we want splined and to what degree we are splining them.

Code

import statsmodels as sm
from statsmodels.gam.api import GLMGam, BSplines

x_spline = training['Gr_Liv_Area']

bs = BSplines(x_spline, df = 5, degree = 3)

gam1 = GLMGam.from_formula('Sale_Price ~ C(Central_Air)', data = training, smoother = bs).fit()

gam1.summary()

Generalized Linear Model Regression Results
Dep. Variable:	Sale_Price	No. Observations:	2051
Model:	GLMGam	Df Residuals:	2045
Model Family:	Gaussian	Df Model:	5.00
Link Function:	Identity	Scale:	2.7832e+09
Method:	PIRLS	Log-Likelihood:	-25209.
Date:	Fri, 25 Oct 2024	Deviance:	5.6917e+12
Time:	13:50:53	Pearson chi2:	5.69e+12
No. Iterations:	3	Pseudo R-squ. (CS):	0.7096
Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	3.154e+04	1.8e+04	1.750	0.080	-3774.352	6.68e+04
C(Central_Air)[T.Y]	5.996e+04	4859.509	12.338	0.000	5.04e+04	6.95e+04
Gr_Liv_Area_s0	1.894e+04	2.26e+04	0.836	0.403	-2.54e+04	6.33e+04
Gr_Liv_Area_s1	1.395e+05	1.67e+04	8.354	0.000	1.07e+05	1.72e+05
Gr_Liv_Area_s2	4.776e+05	3.71e+04	12.877	0.000	4.05e+05	5.5e+05
Gr_Liv_Area_s3	2.585e+05	4.11e+04	6.282	0.000	1.78e+05	3.39e+05

From the output above we see the p-values attached to the spline values of Garage_Area shows the significance of that variable to the model as a whole. Similar to the EARTH algorithm, we can view a plot of the relationship between the variable and its predictions of the target. Here we use the plot function on the gam model object.

Let’s build out a GAM with all of the variables in our data set. The categorical variables are entered as either character variables or with the C function. The continuous variables are defined with the spline function BSplines.

Code

x_spline = training[['Gr_Liv_Area', 
                     'Year_Built',
                     'Mo_Sold',
                     'Lot_Area',
                     'First_Flr_SF',
                     'Second_Flr_SF',
                     'Garage_Area',
                     'Gr_Liv_Area',
                     'TotRms_AbvGrd']]

bs = BSplines(x_spline, df = [5, 5, 5, 5, 5, 5, 5, 5, 5], degree = [3, 3, 3, 3, 3, 3, 3, 3, 3])

gam2 = GLMGam.from_formula('Sale_Price ~ C(Central_Air) + C(Fireplaces) + C(Street) + C(Full_Bath) + C(Half_Bath)', data = training, smoother = bs).fit()

gam2.summary()

Generalized Linear Model Regression Results
Dep. Variable:	Sale_Price	No. Observations:	2051
Model:	GLMGam	Df Residuals:	2007.00
Model Family:	Gaussian	Df Model:	43.00
Link Function:	Identity	Scale:	1.0435e+09
Method:	PIRLS	Log-Likelihood:	-24183.
Date:	Fri, 25 Oct 2024	Deviance:	2.0944e+12
Time:	13:50:53	Pearson chi2:	2.09e+12
No. Iterations:	3	Pseudo R-squ. (CS):	0.9931
Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	3.323e+04	2.53e+04	1.314	0.189	-1.64e+04	8.28e+04
C(Central_Air)[T.Y]	1.645e+04	3324.200	4.949	0.000	9936.295	2.3e+04
C(Fireplaces)[T.1]	1.623e+04	1702.057	9.538	0.000	1.29e+04	1.96e+04
C(Fireplaces)[T.2]	3.338e+04	3267.205	10.217	0.000	2.7e+04	3.98e+04
C(Fireplaces)[T.3]	2.719e+04	1.11e+04	2.457	0.014	5497.722	4.89e+04
C(Fireplaces)[T.4]	-5.137e+04	3.3e+04	-1.557	0.120	-1.16e+05	1.33e+04
C(Street)[T.Pave]	3.064e+04	1.29e+04	2.383	0.017	5434.603	5.58e+04
C(Full_Bath)[T.1]	-2.583e+04	1.4e+04	-1.840	0.066	-5.34e+04	1685.690
C(Full_Bath)[T.2]	-3.106e+04	1.42e+04	-2.188	0.029	-5.89e+04	-3233.616
C(Full_Bath)[T.3]	-2.568e+04	1.53e+04	-1.683	0.092	-5.56e+04	4228.846
C(Full_Bath)[T.4]	-1.102e+05	2.31e+04	-4.780	0.000	-1.55e+05	-6.5e+04
C(Half_Bath)[T.1]	6165.9151	2286.967	2.696	0.007	1683.541	1.06e+04
C(Half_Bath)[T.2]	-3.234e+04	9576.471	-3.377	0.001	-5.11e+04	-1.36e+04
Gr_Liv_Area_s0	2231.9558	9066.908	0.246	0.806	-1.55e+04	2e+04
Gr_Liv_Area_s1	-5.61e+04	1.84e+04	-3.050	0.002	-9.21e+04	-2.01e+04
Gr_Liv_Area_s2	8.657e+04	3.24e+04	2.675	0.007	2.31e+04	1.5e+05
Gr_Liv_Area_s3	-2.001e+05	4.68e+04	-4.277	0.000	-2.92e+05	-1.08e+05
Year_Built_s0	5.988e+04	1.95e+04	3.077	0.002	2.17e+04	9.8e+04
Year_Built_s1	-257.0446	1.1e+04	-0.023	0.981	-2.19e+04	2.14e+04
Year_Built_s2	8.717e+04	1.37e+04	6.345	0.000	6.02e+04	1.14e+05
Year_Built_s3	1.268e+05	1.21e+04	10.486	0.000	1.03e+05	1.5e+05
Mo_Sold_s0	-7352.3104	6107.020	-1.204	0.229	-1.93e+04	4617.229
Mo_Sold_s1	-777.6565	5481.302	-0.142	0.887	-1.15e+04	9965.498
Mo_Sold_s2	-5516.5981	6098.952	-0.905	0.366	-1.75e+04	6437.127
Mo_Sold_s3	-1.197e+04	4586.099	-2.610	0.009	-2.1e+04	-2980.709
Lot_Area_s0	2.735e+04	6065.476	4.509	0.000	1.55e+04	3.92e+04
Lot_Area_s1	6.936e+04	1.92e+04	3.615	0.000	3.18e+04	1.07e+05
Lot_Area_s2	1408.7485	6.12e+04	0.023	0.982	-1.18e+05	1.21e+05
Lot_Area_s3	1.548e+05	3.22e+04	4.807	0.000	9.17e+04	2.18e+05
First_Flr_SF_s0	4.926e+04	1.23e+04	4.016	0.000	2.52e+04	7.33e+04
First_Flr_SF_s1	1.446e+05	3.44e+04	4.199	0.000	7.71e+04	2.12e+05
First_Flr_SF_s2	5.963e+05	6.78e+04	8.795	0.000	4.63e+05	7.29e+05
First_Flr_SF_s3	3.759e+05	1.01e+05	3.740	0.000	1.79e+05	5.73e+05
Second_Flr_SF_s0	-9.071e+04	1.56e+04	-5.829	0.000	-1.21e+05	-6.02e+04
Second_Flr_SF_s1	2.105e+04	1.2e+04	1.755	0.079	-2460.289	4.46e+04
Second_Flr_SF_s2	-1.492e+05	2.54e+04	-5.877	0.000	-1.99e+05	-9.95e+04
Second_Flr_SF_s3	2.521e+05	3.22e+04	7.825	0.000	1.89e+05	3.15e+05
Garage_Area_s0	3.254e+04	5467.470	5.952	0.000	2.18e+04	4.33e+04
Garage_Area_s1	-4.566e+04	8545.934	-5.343	0.000	-6.24e+04	-2.89e+04
Garage_Area_s2	1.502e+05	1.24e+04	12.140	0.000	1.26e+05	1.74e+05
Garage_Area_s3	-2.926e+04	2.03e+04	-1.441	0.149	-6.9e+04	1.05e+04
Gr_Liv_Area_s0	2231.9558	9066.908	0.246	0.806	-1.55e+04	2e+04
Gr_Liv_Area_s1	-5.61e+04	1.84e+04	-3.050	0.002	-9.21e+04	-2.01e+04
Gr_Liv_Area_s2	8.657e+04	3.24e+04	2.675	0.007	2.31e+04	1.5e+05
Gr_Liv_Area_s3	-2.001e+05	4.68e+04	-4.277	0.000	-2.92e+05	-1.08e+05
TotRms_AbvGrd_s0	2.122e+04	1.01e+04	2.100	0.036	1417.313	4.1e+04
TotRms_AbvGrd_s1	-2.113e+04	1.09e+04	-1.936	0.053	-4.25e+04	260.512
TotRms_AbvGrd_s2	-1173.7941	2.01e+04	-0.058	0.953	-4.06e+04	3.82e+04
TotRms_AbvGrd_s3	-2.741e+04	3.47e+04	-0.790	0.429	-9.54e+04	4.06e+04

The top half of the output has the variables not in splines, while the bottom half has the spline variables. There are some variables with high p-values that could be removed from the model.

Summary

In summary, GAM’s are good models to use for prediction, but explanation becomes more difficult and complex. Some of the advantages of using GAM’s:

Allows nonlinear relationships without trying out many transformations manually
Improved predictions
Limited “interpretation” still available
Computationally fast for small numbers of variables

There are some disadvantages though:

Interactions are possible, but computationally intensive
Not good for large number of variables so prescreening is needed
Multicollinearity still a problem.