Generalized Additive Models

General Structure

Generalized Additive Models (GAM’s) provide a general framework for adding of non-linear functions together instead of the typical linear structure of linear regression. GAM’s can be used for either continuous or categorical target variables. The structure of GAM’s are the following:

\[ y = \beta_0 + f_1(x_1) + \cdots + f_p(x_p) + \varepsilon \]

The \(f_i(x_i)\) functions are complex, nonlinear functions on the predictor variables. GAM’s add these complex, yet individual functions together. This allows for many complex relationships to try and model with to potentially predict your target variable better. We will examine a few different forms of GAM’s below.

Piecewise Linear Regression

The slope of the linear relationship between a predictor variable and a target variable can change over different values of the predictor variable. The typical straight-line model \(\hat{y} = \beta_0 + \beta_1x_1\) will not be a good fit for this type of data.

Here is an example of a data set that exhibits this behavior - comprehensive strength of concrete and the proportion of water mixed with cement. The comprehensive strength decreases at a much faster rate for batches with a greater than 70% water/cement ratio.

Linear Regression

Piecewise Linear Regression

If you were to fit a linear regression as the first image above, it wouldn’t represent the data very well. However, this is perfect for a piecewise linear regression. Piecewise linear regression is a model where there are different straight-line relationships for different intervals in the predictor variable. The following piecewise linear regression is for two slopes:

\[ y = \beta_0 + \beta_1x_1 + \beta_2(x_1-k)x_2 + \varepsilon \]

The \(k\) value in the equation above is called the knot value for \(x_1\). The \(x_2\) variable is defined as a value of 1 when \(x_1 > k\) and a value of 0 when \(x_1 \le k\). With \(x_2\) defined this way, when \(x_1 \le k\), the equation becomes \(y = \beta_0 + \beta_1x_1 + \varepsilon\). When \(x1 > k\), the equation gets a new intercept and slope: \(y = (\beta_0 - k\beta_2) + (\beta_1 + \beta_2)x_1 + \varepsilon\).

Let’s see this in each of our softwares!

The piecewise linear regression equation can be extended to have as many pieces as you want. An example with three lines (two knots) is as follows:

\[ y = \beta_0 +\beta_1x_1 + \beta_2(x_1-k_1)x_2 + \beta_3(x_1-k_2)x_3 + \varepsilon \]

One of the problems with this structure is that we have to define the knot values ourselves. The next set of models can help do that for us!

MARS (and EARTH)

Multivariate adaptive regression splines (MARS) is a non-parametric technique that still has a linear form to the model (additive) but has nonlinearities and interaction between variables. Essentially, MARS uses piecewise regression approach to split into pieces then potentially uses either linear or nonlinear patterns for each piece.

MARS first looks for the point in the range of a predictor \(x_i\) where two linear functions on either side of the point provides the least squared error (linear regression).

The algorithm continues on each piece of the piecewise function until many knots are found.

Example of Growing Number of Knots

This will eventually overfit your data. However, the algorithm then works backwards to “prune” (or remove) the knots that do not contribute significantly to out of sample accuracy. This out of sample accuracy calculation is performed by using the generalized cross-validation (GCV) procedure - a computational short-cut for leave-one-out cross-validation. The algorithm does this for all of the variables in the data set and combines the outcomes together.

The actual MARS algorithm is trademarked by Salford Systems, so instead the common implementation in most softwares is enhanced adaptive regression through hinges - called EARTH.

Let’s see how to do this in each of our softwares!

Interpretability of relationships between predictor variables and the target variable starts to get more complicated with the EARTH (or MARS) algorithm. You can plot the relationship as we see above, but those relationships can still be rather complicated and hard to explain to a client.

Smoothing

Generalized additive models can be made up of any non-parametric function of the predictor variables. Another popular technique is to use smoothing functions so the piecewise linear regressions are not so jagged. The following are different types of smoothing functions:

  • LOESS (localized regression)

  • Smoothing splines & regression splines

LOESS

Locally estimated scatterplot smoothing (LOESS) is a popular smoothing technique. The idea of LOESS is to perform weighted linear regression in small windows of a scatterplot of data between two variables. This weighted linear regression is done around each point as the window moves from the low end of the scatterplot values to the high end. An example is shown below:

LOESS Regression

The predictions of the these regression lines in each window are connected together to form the smoothed curve through the scatterplot as shown above.

Smoothing Splines

Smoothing splines take a different approach as compared to LOESS. Smoothing splines have a knot at every single observation for piecewise regression which leads to overfitting. There is a penalty parameter used to counterbalance the “wiggle” of the spline curve.

Smoothing splines try to find the function \(s(x_i)\) that optimally fits \(x_i\) to the target variable through the following equation:

\[ \min\sum_{i=1}^n (y_i - s(x_i))^2 + \lambda\int s''(t_i)^2 dt \]

By thinking of \(s(x_i)\) as a prediction of \(y\), the front half of the equation is equal to the sum of squared errors in your model. The second half of the equation above has the \(\lambda\) penalty applied to the integral of the second derivative of the smoothing function. To conceptually think of this second derivative we can think of it as the “slope of slopes” which is large when the curve has a lot of “wiggle”. The optimal value of the \(\lambda\) penalty is estimated with another approximation of leave-one-out cross-validation.

Regression splines are just a computationally nicer version of smoothing splines so they will not be covered in detail here.

Let’s see how to do GAM’s with splines in each of our softwares!

Summary

In summary, GAM’s are good models to use for prediction, but explanation becomes more difficult and complex. Some of the advantages of using GAM’s:

  • Allows nonlinear relationships without trying out many transformations manually

  • Improved predictions

  • Limited “interpretation” still available

  • Computationally fast for small numbers of variables

There are some disadvantages though:

  • Interactions are possible, but computationally intensive

  • Not good for large number of variables so prescreening is needed

  • Multicollinearity still a problem.