Categorical Data Analysis

Exploratory Data Analysis

First, we need to first explore our data before building any models to try and explain/predict our categorical target variable. With categorical variables, we can look at the distribution of the categories as well as see if this distribution has any association with other variables. For this analysis we are going to the popular Ames housing dataset. This dataset contains information on home values for a sample of nearly 3,000 houses in Ames, Iowa in the early 2000s. To access this data, we first add the AmesHousing package and create the nicely formatted data with the make_ordinal_ames() function.

Code
library(AmesHousing)

ames <- make_ordinal_ames()

Imagine you worked for a real estate agency and got a bonus check if you sold a house above $175,000 in value. Let’s create this variable in our data:

Code
library(tidyverse)

ames <- ames %>%
  mutate(Bonus = ifelse(Sale_Price > 175000, 1, 0))

Before exploring any relationships between predictor variables and the target variable Bonus, we need to split our dataset into training and testing pieces. Because models are prone to discovering small, spurious patterns on the data that is used to create them (the training data), we set aside the validation and/or testing data to get a clear view of how they might perform on new data that the models have never seen before.

Code
set.seed(123)

ames <- ames %>% mutate(id = row_number())

train <- ames %>% sample_frac(0.7)

test <- anti_join(ames, train, by = 'id')

You are interested in what variables might be associated with obtaining a higher chance of getting a bonus (selling a house above $175,000). An association exists between two categorical variables if the distribution of one variable changes when the value of the other categorical variable changes. If there is no association, the distribution of the first variable is the same regardless of the value of the other variable. For example, if we wanted to know if obtaining a bonus on selling a house in Ames, Iowa was associated with whether the house had central air we could look at the distribution of bonus eligible houses. If we observe that 42% of homes with central air are bonus eligible and 42% of homes without central air are bonus eligible, then it appears that central air has no bearing on whether the home is bonus eligible. However, if instead we observe that only 3% of homes without central air are bonus eligible, but 44% of home with central air are bonus eligible, then it appears that having central air might be related to a home being bonus eligible.

To understand the distribution of categorical variables we need to look at frequency tables. A frequency table shows the number of observations that occur in certain categories or intervals. A one-way frequency table examines all the categories of one variable. These are easily visualized with bar charts.

Let’s see how to do this in each of our softwares!

Tests of Association

We have statistical tests to evaluate relationships between two categorical variables. The null hypothesis for these statistical tests is that the two variables have no association - the distribution of one variable does not change across levels of another variable. The alternative hypothesis is an association between the two variables - the distribution of one variable changes across levels of another variable.

These statistical tests follow a \(\chi^2\)-distribution. The \(\chi^2\)-distribution is a distribution that has the following characteristics:

  • Bounded below by 0

  • Right skewed

  • One set of degrees of freedom

A plot of a variety of \(\chi^2\)-distributions is shown here:

Two common \(\chi^2\) tests are the Pearson and Likelihood Ratio \(\chi^2\) tests. They compare the observed count of observations in each cell of a cross-tabulation table to their expected count if there was no relationship. The expected cell count applies the overall distribution of one variable across all the levels of the other variable. For example, overall 59% of all homes are not bonus eligible. If that were to apply to every level of central air, then the 147 homes without central air would be expected to have 86.8 ( $ = 147 $ ) of them would be bonus eligible while 60.2 ( $ = 147 $ ) of them would not be bonus eligible. We actually observe 142 and 5 homes for each of these categories respectively. The further the observed data is from the expected data, the more evidence we have that there is a relationship between the two variables.

The test statistic for the Pearson \(\chi^2\) test is the following:

\[ \chi^2_P = \sum_{i=1}^R \sum_{j=1}^C \frac{(Obs_{i,j} - Exp_{i,j})^2}{Exp_{i,j}} \]

From the equation above, the closer that the observed count of each cross-tabulation table cell (across all rows and columns) to the expected count, the smaller the test statistic. As with all hypothesis tests, the smaller the test statistic, the larger the p-value, implying less evidence for the alternative hypothesis.

Another common test is the Likelihood Ratio test. The test statistic for this is the following:

\[ \chi^2_L = 2 \times \sum_{i=1}^R \sum_{j=1}^C Obs_{i,j} \times \log(\frac{Obs_{i,j}}{Exp_{i,j}}) \]

The p-value comes from a \(\chi^2\)-distribution with degrees of freedom that equal the product of the number of rows minus one and the number of columns minus one. Both of the above tests have a sample size requirement. The sample size requirement is 80% or more of the cells in the cross-tabulation table need expected counts larger than 5.

For smaller sample sizes, this might be hard to meet. In those situations, we can use a more computationally expensive test called Fisher’s exact test. This test calculates every possible permutation of the data being evaluated to calculate the p-value without any distributional assumptions.

Both the Pearson and Likelihood Ratio \(\chi^2\) tests can handle any type of categorical variable (either ordinal, nominal, or both). However, ordinal variables provide us extra information since the order of the categories actually matters compared to nominal categories. We can test for even more with ordinal variables against other ordinal variables whether two ordinal variables have a linear association as compared to just a general one. An ordinal test for association is the Mantel-Haenszel \(\chi^2\) test. The test statistic for the Mantel-Haenszel \(\chi^2\) test is the following:

\[ \chi^2_{MH} = (n-1)r^2 \] where \(r^2\) is the Pearson correlation between the column and row variables. This test follows a \(\chi^2\)-distribution with only one degree of freedom.

Let’s see how to do each of these tests in each of our softwares!

Measures of Association

Tests of association are best designed for just that, testing the existence of an association between two categorical variables. However, hypothesis tests are impacted by sample size. When we have the same sample size, tests of association can rank significance of variables with p-values. However, when sample sizes are not the same (or degrees of freedom are not the same) between two tests, the tests of association are not best for comparing the strength of an association. In those scenarios, we have measures of strength of association that can be compared across any sample size.

Measures of association were not designed to test if an association exists, as that is what statistical testing is for. They are designed to measure the strength of association. There are dozens of these measures. Three of the most common are the following: - Odds Ratios (only for comparing two binary variables) - Cramer’s V (able to compare nominal variables with any number of categories) - Spearman’s Correlation (able to compare ordinal variables with any number of categories).

An odds ratio indicates how much more likely, with respect to odds, a certain event occurs in one group relative to its occurrence in another group. The odds of an event occurring is not the same as the probability that an event occurs. The odds of an event occurring is the probability the event occurs divided by the probability that event does not occur.

\[ Odds = \frac{p}{1-p} \]

Let’s again examine the cross-tabulation table between central air and bonus eligibility.


 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  2051 

 
                  | train$Bonus 
train$Central_Air |         0 |         1 | Row Total | 
------------------|-----------|-----------|-----------|
                N |       142 |         5 |       147 | 
                  |    35.112 |    50.620 |           | 
                  |     0.966 |     0.034 |     0.072 | 
                  |     0.117 |     0.006 |           | 
                  |     0.069 |     0.002 |           | 
------------------|-----------|-----------|-----------|
                Y |      1069 |       835 |      1904 | 
                  |     2.711 |     3.908 |           | 
                  |     0.561 |     0.439 |     0.928 | 
                  |     0.883 |     0.994 |           | 
                  |     0.521 |     0.407 |           | 
------------------|-----------|-----------|-----------|
     Column Total |      1211 |       840 |      2051 | 
                  |     0.590 |     0.410 |           | 
------------------|-----------|-----------|-----------|

 

Let’s look at the row without central air. The probability that a home without central air is not bonus eligible is 96.6%. That implies that the odds of not being bonus eligible in homes without central air is 28.41 (= 0.966/0.034). For homes with central air, the odds of not being bonus eligible are 1.28 (= 0.561/0.439). The odds ratio between these two would be approximately 22.2 (= 28.41/1.28). In other words, homes without central air are 22.2 times as likely (in terms of odds) to not be bonus eligible as compared to homes with central air. This relationship is intuitive based on the numbers we have seen. Without going into details, it can also be shown that homes with central air are 22.2 times as likely (in terms of odds) to be bonus eligible.

Cramer’s V is another measure of strength of association. Cramer’s V is calculated as follows:

\[ V = \sqrt{\frac{\chi^2_P/n}{\min(Rows-1, Columns-1)}} \]

Cramer’s V is bounded between 0 and 1 for every comparison other than two binary variables. For two binary variables being compared the bounds are -1 to 1. The idea is still the same for both. The further the value is from 0, the stronger the relationship. Unfortunately, unlike \(R^2\), Cramer’s V has no interpretative value. It can only be used for comparison.

Lastly, we have Spearman’s correlation. Much like the Mantel-Haenszel test of association was specifically designed for comparing two ordinal variables, Spearman correlation measures the strength of association between two ordinal variables. Spearman is not limited to only categorical data analysis as it is also used for detecting heteroskedasticity in linear regression. Remember, Spearman correlation is a correlation on the ranks of the observations as compared to the actual values of the observations.

As previously mentioned, these are only a few of the dozens of different measures of association that exist. However, they are the most used ones.

Let’s see how to do this in each of our softwares!