Introduction to Simulation
Introduction to Simulation
Most analysis assumes a rather unrealistic view of the real world - certainty. In a real world setting - especially the business world - the inputs and coefficients in a problem are rarely fixed quantities. One possible solution to this is sensitivity analysis (like reduced cost and shadow prices) in optimization. However, a very popular approach to handling this uncertainty and trying to evalutate the effects of uncertainty are simulations.
Simulations help us determine not only the full array of outcomes of a given decision, but the probabilities of these outcomes occurring. These are extremely useful in many fields. Simulations - like target shuffling - are used to evaluate how likely your model occurred due to random chance. The field of risk management tries to simulate rare outcomes and measure their impact. This code deck will only cover the concept of simulation in general as well as some basic examples. For a full introduction to risk management please see the Introduction to Risk Management code deck.
What is the general idea behind simulation analysis? A fixed (or constant) input to a deterministic model will lead to a fixed output. Instead, in a simulation analysis, each input of a model (or process) is assigned a range of possible values - the probability distribution of the inputs. We then analyze what happens to the decision from our model (or process) under all these possiblre scenarios. Simulation analysis describes not only the outcomes of the certain decisions, but also the probability distribution of those outcomes. This helps us determine the probability that each of these outcomes occurs. For example, assume a stock price is $100 today. Next, let’s assume that it follows a random walk (random up or down movement centered at 0) each day for the next 100 days. This random walk follows a Gaussian (normal) distribution with a mean of 0 and standard deviation of 1. Let’s simulate some potential paths for this stock and compare them to the actual path.
After the simulation is complete, the analysis turns to the probability distribution of the outcomes. We can calculate probabilities of outcomes or describe characteristics of this new distribution such as the mean, median, variance, skewness, percentiles, etc. From our example above we have a distrobution of possible prices that the stock can take after 100 days. In all those possible values, only 27.2% of them had a lower stock price.
Introductory Example
Let’s work through an example. The first thing that we need to do is load up all of the needed libraries in R that we will be using in these course notes. This isn’t needed for the SAS sections of the code.
install.packages("quantmod")
install.packages("graphics")
install.packages("ks")
install.packages("gtools")
library(quantmod)
library(graphics)
library(ks)
library(gtools)
Suppose you want to invest $1,000 (call this \(P_0\)) in the US stock market for one year. You invest in a mutual fund that tries to produce the same return as the S&P500 Index. Let’s keep this example basic and only work with annual return. The price after one year can be calculated as follows:
\[ P_1 = P_0 \times (1+r_{0,1}) \] The term \(r_{0,1}\) refers to the annual return from time point 0 to time point 1. Let’s assume annual returns follow a Normal distribution with a historical mean of 8.79% and a standard deviation of 14.75%.
Let’s see how we would simulate this problem in each of our softwares!
R
Simulation is rather easy to do in R because R is a vector language. This makes it so you don’t have to use as many loops as in other languages. This is easily seen in this example. We start by setting a seed with the set.seed
function so we can replicate our results. We then just use the rnorm
function to randomly generate the possible values of the one year annual return. The n =
option tells R the number of random draws you would like from the distribution. The mean =
and sd =
options define the mean and standard deviation of the normal distribtion respectively. This creates a vector of observations. With R’s ability to do vector math, we can just add 1 to every element in the vector and multiply that by our initial investment value defined as the P0 object below.
set.seed(112358)
<- rnorm(n=10000, mean=0.0879, sd=0.1475)
r <- 1000
P0 <- P0*(1+r) P1
Then our simulation is complete! Next, we just need to view the distribution of our simulation. Below we use the summary
function to calculate some summary statistics around our distribution of possible values of our portfolio after one year. The best way to truly look at results of a simulation though are through plots. We then plot the histogram of P1, the values of our portfolio after one year.
summary(P1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
514.4 986.3 1086.7 1086.6 1185.8 1585.2
hist(P1, breaks=50, main='One Year Value Distribution', xlab='Final Value')
abline(v = 1000, col="red", lwd=2)
mtext("Initial Inv.", at=1000, col="red")
Not surprisingly, we can see that a majority of the values are above our initial investment of 1000. In fact, on average, we would expect to manke 8.79% return as that is the mean of our normal distribution. However, there is a spread to our returns as well so not all returns are above our initial investment. We can see there are a portion of scenarios that have us fall below our initial investment value.
How did we select the normal distribution for our returns? Is that a good guess of a distribution? That we will need to find out in our next section.
SAS
Simulations in SAS highly depend on the DATA
step. This is easily seen in this example. We start by creating a new data set called SPIndex using the DATA
step. We then use a do
loop. Here we want to loop across the values of i from 1 to 10,000. We set up our initial investment value as P0 and define r as the return. We use the RAND
function to draw a random value for our returns. This random value comes from the options specified in the RAND
function. Here we specify that we want a 'Normal'
distribution with a mean of 0.0879 and standard deviation of 0.1475. We then calculate the portfolio value after 1 year and call it P1. The OUTPUT
statement saves this calculation and moves to the next step in the loop. The END
statement defines the end of the loop.
data SPIndex;
do i = 1 to 10000;
P0 = 1000;
r = RAND('Normal', 0.0879, 0.1475);
P1 = P0*(1 + r);
output;
end;
run;
Then our simulation is complete! Next, we just need to view the distribution of our simulation. Below we use the PROC MEANS
procedure to calculate some summary statistics around our distribution of possible values of our portfolio after one year. We sue the DATA =
option to define the dataset of interest and the VAR
statement to define the variable we are interested in summarizing, here P1. The best way to truly look at results of a simulation though are through plots. We then plot the histogram of P1, the values of our portfolio after one year, using PROC SGPLOT
and the HISTOGRAM
statement for P1.
proc means data=SPIndex;
var P1;
run;
proc sgplot data=SPIndex;
title 'One Year Value Distribution';
histogram P1;
refline 1000 / axis=x label='Initial Inv.' lineattrs=(color=red thickness=2);
keylegend / location=inside position=topright;
run;
Not surprisingly, we can see that a majority of the values are above our initial investment of 1000. In fact, on average, we would expect to manke 8.79% return as that is the mean of our normal distribution. However, there is a spread to our returns as well so not all returns are above our initial investment. We can see there are a portion of scenarios that have us fall below our initial investment value.
How did we select the normal distribution for our returns? Is that a good guess of a distribution? That we will need to find out in our next section.
Distribution Selection
When designing your simulations, the biggest choice comes from the decision of the distribution on the inputs that vary. There are 3 common methods for selecting these distributions:
- Common Probability Distribution
- Historical (Empirical) Distribution
- Hypothesized Future Distribution
Common Probability Distribution
Typically, we assume a common probability distribution for inputs that vary in a simulation. We did this in our above example where we just assumed returns followed a normal distribution. When selecting distributions, we need to first decide if our variables are discrete or continuous in nature. A discrete variable take on outcomes that you can list out (even if you can not list all of them). Continuous variables on the other hand take on any possible value in a interval (even if the interval is unbounded).
Discrete Distribution Examples
Some commonly used discrete distributions for simulations are the following:
- Uniform distribution
- Poisson distribution
In the discrete uniform, every discrete value of the distribution has an equal probability of occurring.
The Poisson distribution is best used for when you are counting events. It is bounded below by 0 and takes on all non-negative integer values with differing probabilities of occurrences based on the location of the distribution.
Continuous Distribution Examples
Some commonly used continuous distributions for simulations are the following:
- Uniform distribution
- Triangle distribution
- Normal distribution
- Student’s t-distribution
- Lognormal distribution
- Exponential distribution
- Beta distribution
Much like its discrete counterpart, in the continuous uniform distribution every value of the distribution has an equal probability of occurring.
The triangle distribution is a great distribution for business scenarios where individuals might only be able to tell you three things - the worst case scenario, the best case scenario, and the most likely (in the middle) scenario. With these three points we can define the triangle distribution.
The Gaussian (normal) distribution is one of the most popular distirbutions out there. Bell-shaped and symmetric with nice mathematical properties, the normal distribution is used for inputs in many simulations.
The student’s t-distribution is bell-shaped and symmetric like the normal distribution, but it has thicker tails. This allows for more variability than the normal distribution provides.
The lognormal distribution is very popular in the financial simulation world as it has nice asymmetric properties which stock returns sometimes follow.
The exponential distribution is great for simulating risky (sometimes called tail) events. It is one of many distributions like it used in Extreme Value Theory.
The Beta distribution is a family of distributions that take many shapes. Some of the most popular shpaes in terms of simulations are the “bathtub” like shape that has higher probabilities of edge cases happening as compared to the middle cases.
Historical (Empirical) Distributions
If you are unsure of the distribution of the data you are trying to simulate, you can estimate it using the process of kernel density estimation. Kernel density estimation is a non-parametric method of estimating distributions of data through smoothing out of data values.
The kernel density estimator is as follows:
\[ \hat{f}(x)=\frac{1}{nh} \sum_{i=1}^n \kappa (\frac{x-x_i}{h}) \] In the above equation \(\kappa\) is the kernel function. The kernel function builds a distribution centered around each of the points in your dataset. Imagine a small normal distribution built around each point and then adding these normal distributions together to form one larger distribution. The default kernel function distribution in SAS and R is the normal distribution, but there are other options like the quadratic, triangle, and Epanechnikov. The In the above equation \(h\) is the bandwidth. The bandwidth controls the variability around each point. The larger the variability, the more variation is built around each point through the distribution as seen in the chart below:
The above plot shows kernel density estimates of data sampled from a standard normal distribution. The actual distirbution is highlighted with the dashed line above. Three estimates with different bandwidths are shown. The smaller the bandwidth, the smaller the variability around each point, but also the more highly fit our curve is to our data. The larger the bandwidth, the more variability around each point which smooths out the curve much more. We want to find the balance. There is a lot of research that has been done and still ongoing on how to best select bandwidth. We will leave it up to the computer to optimize this decision for us.
Once we have this kernel density estimate for our data, we can sample from this estimated distribution for our simulation. In our previous example we assumed a normal distribution for our returns of our mutual fund. Now, let’s build an empirical distribution through kernel density estimation and sample from that instead of the normal distribution.
Let’s see how to do this in each of our softwares!
R
Kernel density estimation can only happen when you have data to estimate a distribution on. First, we need to collect historical data on the S&P500 Index. We can do this through the getSymbols
function in R. By putting in the stock ticker as it appears in the New York Stock Exchange, we can pull data from Yahoo Finance through this function. The ticker for the S&P500 Index is ^GSPC. The getSymbols
function then gets us historical, daily opening prices, closing prices, high prices, and low prices all the way back to 2007. We want to calculate the annual return on the closing prices for the S&P500 Index. We can use the periodReturn
function in R on the GSPC.Close column of the GSPC object created from the getSymbols
function. The period =
option defines the period you want to calculate returns on. We want yearly returns. We then plot a histogram of the data.
= "^GSPC"
tickers
getSymbols(tickers)
[1] "^GSPC"
<- periodReturn(GSPC$GSPC.Close, period = "yearly")
gspc_r
hist(gspc_r, main='Historical S&P500', xlab='S&P500 Annual Returns')
Next we need to use the density
function in R to estimate the bandwidth for the kernel density estimation.
<- density(gspc_r)
Density.GSPC Density.GSPC
Call:
density.default(x = gspc_r)
Data: gspc_r (15 obs.); Bandwidth 'bw' = 0.06879
x y
Min. :-0.59123 Min. :0.004351
1st Qu.:-0.31783 1st Qu.:0.122386
Median :-0.04442 Median :0.379342
Mean :-0.04442 Mean :0.913284
3rd Qu.: 0.22898 3rd Qu.:1.796440
Max. : 0.50239 Max. :2.626864
As you can see, the bandwidth is calculated and provided in the output above. We then use the rkde
function in place of the rnorm
function to randomly sample from the kernel density estimate of our distribution. The input to the rkde
function is fhat = kde(gspc_r, h = 0.06908)
. Notice we have the returns in the gspc_r object and the bandwidth number in the h =
option. From there, we can specify how many random observations we draw from the distribution. We then plot a histogram of this distribution to get an idea of the kernel density estimate distribution.
set.seed(112358)
<- rkde(fhat=kde(gspc_r, h=Density.GSPC$bw), n=10000)
Est.GSPC hist(Est.GSPC, breaks=50, main='KDE of Historical S&P500', xlab='S&P500 Annual Returns')
We can replace the normal distribution returns with these new KDE returns in our portfolio calculation to find the value after one year. This is done below with the new histogram of those returns.
<- Est.GSPC
r <- 1000
P0 <- P0*(1+r)
P1
hist(P1, breaks=50, main='One Year Value Distribution', xlab='Final Value')
abline(v = 1000, col="red", lwd=2)
mtext("Initial Inv.", at=1000, col="red")
This looks a little different than when we simulated with the normal distribution. The average is about the same, but there are two collections of data in this histogram. We have a smaller spike around losing 40% of our worth to get the value down to only $600. This is based on the historical returns.
SAS
Kernel density estimation can only happen when you have data to estimate a distribution on. First, we need to collect historical data on the S&P500 Index. We can do this through the getSymbols
function in R. By putting in the stock ticker as it appears in the New York Stock Exchange, we can pull data from Yahoo Finance through this function. We will use this data gathered from R (see the R section of code for how) here in SAS for the analysis.
Next we need to use the PROC KDE
procedure in SAS to estimate the bandwidth for the kernel density estimation. We use the GSPC dataset we gathered from R. The UNIVAR
statement on the returns variable along with the UNISTATS
option shows us some summary metrics on this estimated density along with the bandwidth.
proc kde data=SimRisk.GSPC;
univar returns / unistats;
run;
Inputs | |
---|---|
Data Set | SIMRISK.GSPC |
Number of Observations Used | 15 |
Variable | returns |
Bandwidth Method | Sheather-Jones Plug In |
Controls | |
---|---|
returns | |
Grid Points | 401 |
Lower Grid Limit | -0.661 |
Upper Grid Limit | 0.5726 |
Bandwidth Multiplier | 1 |
Univariate Statistics | |
---|---|
returns | |
Mean | 0.084 |
Variance | 0.028 |
Standard Deviation | 0.17 |
Range | 0.68 |
Interquartile Range | 0.19 |
Bandwidth | 0.092 |
As you can see, the bandwidth is calculated and provided in the output above. We need to use PROC IML
in SAS to random from this kernel density estimated distribution. IML stands for interactive matrix language and serves as a matrix language inside of SAS similar in vain to that of R. IML code is much different than SAS and won’t be covered in too much detail here. Essentially, the code below creates a function SmoothBootstrap
that samples an observation from the inputed dataset. It then randomly generates a sample of 1 from a normal distribution around this point a certain number of prespecified times (here 10,000 in the code). We then import the data set using the USE
and READ
functions to take the variable returns and store it as x. From there we call the SmoothBootstrap
function we created to sample from the kernel density estimate using the bandwidth from the output above as one of the inputs to this function. We add 1 to this vector of numbers and multiply by 1,000 to get the final portfolio value after one year. We then use CREATE
to create a new dataset called Smooth to store the results with the APPEND
statement. Once we are complete, we view the distribution with PROC SGPLOT
as we didn previously when we used the normal distribution.
proc iml;
start SmoothBootstrap(x, B, Bandwidth);
N = 1;
s = Sample(x, N // B);
eps = j(B, N);
call randgen(eps, "Normal", 0, Bandwidth);
return( s + eps );
finish;
use SimRisk.GSPC;
read all var {returns} into x;
close SimRisk.GSPC;
call randseed(12345);
y = (1 + SmoothBootstrap(x, 10000, 0.092))*1000;
Est_y = y`;
create Smooth var {"Est_y"};
append;
close Smooth;
quit;
proc sgplot data=Smooth;
title 'One Year Value Distribution';
histogram Est_y;
refline 1000 / axis=x label='Initial Inv.' lineattrs=(color=red thickness=2);
keylegend / location=inside position=topright;
run;
This looks a little different than when we simulated with the normal distribution. The average is about the same, but there are two collections of data in this histogram. We have a smaller spike around losing 40% of our worth to get the value down to only $600. This is based on the historical returns.
Hypothesized Future Distribution
The final choice for a distribution is the hypothesized future distribution. Maybe you know of an upcoming change that will occur in your variable so that the past information is not going to be the future distribution. For example, the volatility in the market is forecasted to increase, so instead of a standard deviation of 14.75% you forecast it to be 18.25%. In these situations, you can select any distribution of choice.
Compounding and Correlations
Complication arises in simulation when you are now simulating multiple inputs changing at the same time. Even when the distributions of these inputs are the same, the final result can still be hard to mathematically calculate. This shows the value of simulation since we don’t have to work out the mathematical equations for the combination of many input distributions.
When dealing with multiple input probability distributions there are some general facts we can use.
When a constant is added to a random variable (the variable with the distribution) then the distribution is the same, only shifted by the constant.
The addition of many distributions that are the same is rarely the same shape of distribution.
- The exception to this are INDEPENDENT normal distributions.
The product of many distributions that are the same is rarely the same shape of distribution.
- The exception to this are INDEPENDENT lognormal distributions.
Introductory Example Extended
Let’s adapt our previous example of investing in mutual funds. Now you want to invest $1,000 in the US stock market for 30 years. You invest in a mutual fund that tries to produce the same return as the S&P500 Index. Again, using the simple scenario where compounding isn’t contiuous, we have the following value of our portfolio over 30 years:
\[ P_{30} = P_0 \times (1+r_{0,1}) \times (1+r_{1,2}) \times \cdots \times (1+r_{29,30}) \]
The returns are structured as annual returns between each year which are now multiplied together. Again, we will assume annual returns follow a Normal distribution with historical mean of 8.79% and standard deviation of 14.75% for every year.
Let’s see how to do this in our softwares!
R
Simulation is rather easy to do in R because R is a vector language. This makes it so you don’t have to use as many loops as in other languages. However, in this example we use loops to give the reader a sense of how loops work in R. This example can be done without loops though. We start by setting a seed with the set.seed
function so we can replicate our results. We then just use the rnorm
function to randomly generate the possible values of the one year annual return. The n =
option tells R the number of random draws you would like from the distribution. The mean =
and sd =
options define the mean and standard deviation of the normal distribtion respectively. However, we do this within a couple of loops. The first loop is based on the number of simulation we want to run (here 10,000). We then calculate the return of 1 year and then loop through 29 more years of the same calculation by multiplying the additional return each iteration through the loop.
set.seed(112358)
<- rep(0,10000)
P30 for(i in 1:10000){
<- 1000
P0 <- rnorm(n=1, mean=0.0879, sd=0.1475)
r
<- P0*(1 + r)
Pt
for(j in 1:29){
<- rnorm(n=1, mean=0.0879, sd=0.1475)
r <- Pt*(1+r)
Pt
}<- Pt
P30[i] }
Then our simulation is complete! Next, we just need to view the distribution of our simulation. Below we use the summary
function to calculate some summary statistics around our distribution of possible values of our portfolio after 30 years. The best way to truly look at results of a simulation though are through plots. We then plot the histogram of P30, the values of our portfolio after 30 years.
summary(P30)
Min. 1st Qu. Median Mean 3rd Qu. Max.
404.1 5748.6 9710.1 12681.8 16141.5 185117.3
hist(P30, breaks=50, main='30 Year Value Distribution', xlab='Final Value')
abline(v = 1000, col="red", lwd=2)
mtext("Initial Inv.", at=1000, col="red")
Not surprisingly, we can see that almost all of the values are above our initial investment of 1000. In fact, on average, we would expect to manke 8.79% return annually as that is the mean of our normal distribution. However, there is a spread to our returns and that our returns are no longer normally distributed. They are now right skewed because we multiplied the effects year over year.
SAS
Simulations in SAS highly depend on the DATA
step. This is easily seen in this example. We start by creating a new data set called SPIndex using the DATA
step. We then use a do
loop. Here we want to loop across the values of i from 1 to 10,000. We set up our initial investment value as P0 and define r as the return. We use the RAND
function to draw a random value for our returns. This random value comes from the options specified in the RAND
function. Here we specify that we want a 'Normal'
distribution with a mean of 0.0879 and standard deviation of 0.1475. We then calculate the portfolio value after 1 year and call it Pt. We then loop through 29 more years of the same calculation by multiplying the additional return each iteration through the loop. The OUTPUT
statement saves this calculation and moves to the next step in the loop. The END
statement defines the end of the loop.
data SPIndex30;
do i = 1 to 10000;
P0 = 1000;
r = RAND('Normal', 0.0879, 0.1475);
Pt = P0*(1 + r);
do j = 1 to 29;
r = RAND('Normal', 0.0879, 0.1475);
Pt = Pt*(1 + r);
end;
output;
end;
run;
Then our simulation is complete! Next, we just need to view the distribution of our simulation. Below we use the PROC MEANS
procedure to calculate some summary statistics around our distribution of possible values of our portfolio after 30 years. We use the DATA =
option to define the dataset of interest and the VAR
statement to define the variable we are interested in summarizing, here Pt. The best way to truly look at results of a simulation though are through plots. We then plot the histogram of Pt, the values of our portfolio after 30 years, using PROC SGPLOT
and the HISTOGRAM
statement for Pt.
proc means data=SPIndex30;
var Pt;
run;
proc sgplot data=SPIndex30;
title 'One Year Value Distribution';
histogram Pt;
refline 1000 / axis=x label='Initial Inv.' lineattrs=(color=red thickness=2);
keylegend / location=inside position=topright;
run;
Not surprisingly, we can see that almost all of the values are above our initial investment of 1000. In fact, on average, we would expect to make 8.79% return annually as that is the mean of our normal distribution. However, there is a spread to our returns and that our returns are no longer normally distributed. They are now right skewed because we multiplied the effects year over year.
Adding Correlation
Not all inputs may be independent of each other. Having correlations between your input variables adds even more complication to the simulation and final distribution. In these scenarios, we may need to simulate random variables that have correlation with each other.
One way to “add” correlation to data is to multiply the correlation into the data through matrix multiplication (linear algebra). Think about the approach we would take to change the variance of a single variable. If we had a variable X with a normal distribution (mean of 3 and variance of 2), then how could we get the variance to be 4 instead? We can do this in a few simple steps:
- Standardize X to have a mean of 0 and variance of 1, \(\frac{X-3}{\sqrt(2)}\)
- Multiply X by the standard deviation you desire, \(\sqrt(4)\times X\)
- Recenter X back to its original mean, \(\sqrt(4) \times X + 3\)
Now we have this new \(Y\) where \(Y = \sqrt(4) \times X + 3\) which has the distribution that we want.
Now we have a variable that is normally distributed with a mean of 3 and variance of 4 like we wanted. We want to do the same thing on a multivariate scale to add correlation into data. We have a desired correlation matrix (that is a standardized version of the covariance matrix). We need to multiply this into the variables of interest to create our new variables. We do this in similar steps:
- Standardize each column of our data (X)
- Multiply X by the “square root” of the correlation matrix you want (this is called the Cholesky decomposition)
- Convert X back to its original pre-standardized mean and variance.
The Cholesky decomposition is essentially the equivalent to the square root of a number. A square root of a number is a number when multiplied by itself gets the original number. The Cholesky decomposition is the same in matrix form. It is the matrix \(L\), when multiplied by itself (\(L \times L^T\)), gives you the original matrix.
To add in the correlation to a set of data, we multiply this Cholesky decomposition of the desired covariance matrix by our data. If we had two variables, essentially this multiplication will leave the first column of data alone and “bend” the second column to look more like the first. This method works best for normally distributed variables, but still OK if the variables are symmetric and unimodal.
Let’s extend our 30 year investment example from above. Instead of just investing in stocks, you are splitting your money equally across stocks and bonds for 30 years. The individual calculations will be similar to above, just with a different set of returns for bonds as compared to stocks.
\[ P_{30,S} = P_{0,S} \times (1+r_{0,1,S}) \times (1+r_{1,2,S}) \times \cdots \times (1+r_{29,30,S}) \] \[ P_{30,B} = P_{0,B} \times (1+r_{0,1,B}) \times (1+r_{1,2,B}) \times \cdots \times (1+r_{29,30,B}) \] \[ P_{30} = P_{30,S} + P_{30,B} \] Historically, Treasury bonds are perceived as safer investments so when the stock market does poorly more people invest in bonds. This leads to a negative correlation between the two. Let’s assume that our returns for stocks follow a normal distribution with mean of 8.79% and standard deviation of 14.75%, our bond returns follow a normal distribution with mean of 4% and standard deviation of 7%, and there is a correlation of -0.2 between the returns in any given year.
Let’s see how to solve this in our softwares!
R
Before running the simulation we need to set some things up. We first create a vector as long as the simulation we want to run full of 0’s. We will replace these 0’s, with the portfolio values as we go through the simulation. Next we create the correlation matrix we desire in the object R. In the next line, we take the Cholesky decomposition of R with the chol
function. We then transpose this with the t
function. From there we define some simple parameters around what percentage of investment we put into each stocks and bonds as well as the initial investment value itself.
We then create two functions. The first, called standardize
takes a vector of data and creates a standardized version of that data. The second function, called destandardize
, reverses this operation. By providing a vector of standardized data as well as the original version of the data, this function will transform the data back to its original mean and variance.
<- rep(0,10000)
Value.r.bal <- matrix(data=cbind(1,-0.2, -0.2, 1), nrow=2)
R <- t(chol(R))
U <- 0.5
Perc.B <- 0.5
Perc.S <- 1000
Initial
<- function(x){
standardize = (x - mean(x))/sd(x)
x.std return(x.std)
}
<- function(x.std, x){
destandardize = (x.std * sd(x)) + mean(x)
x.old return(x.old)
}
Next, we have the simulation itself. In this simulation, we loop through the same calculation many times. First, we simulate 30 returns for each stocks and bonds from there respective normal distributions. Next, we combine these vectors into a matrix after standardizing each of them. We when multiply the Cholesky decomposition by this standardized data matrix. From there, we use the destandardize
function to put our data back to its original mean and variance. Now we have 30 years of correlated returns. From there we just loop through these returns to calculate the portfolio value over 30 years. Lastly, we sum up the returns from each of the two investments.
set.seed(112358)
for(j in 1:10000){
<- rnorm(n=30, mean=0.0879, sd=0.1475)
S.r <- rnorm(n=30, mean=0.04, sd=0.07)
B.r <- cbind(standardize(S.r), standardize(B.r))
Both.r <- U %*% t(Both.r)
SB.r <- t(SB.r)
SB.r
<- cbind(destandardize(SB.r[,1], S.r), destandardize(SB.r[,2], B.r))
final.SB.r
<- Initial*Perc.B
Pt.B <- Initial*Perc.S
Pt.S for(i in 1:30){
<- Pt.B*(1 + final.SB.r[i,2])
Pt.B <- Pt.S*(1 + final.SB.r[i,1])
Pt.S
}<- Pt.B + Pt.S
Value.r.bal[j]
}
hist(Value.r.bal, breaks=50, main='30 Year Value Distribution', xlab='Final Value')
abline(v = 1000, col="red", lwd=2)
mtext("Initial Inv.", at=1000, col="red")
By looking at the histogram above, we see what we would expect. We have a right skewed distribution with almost all possibilities resulting in a higher portfolio value after 30 years. This portfolio doesn’t have as high of a tail as only stocks due to the “safer” investment of the bonds.
I invite you to play around with the balance of money allocated to stocks and bonds as the trade-off might surprise you with finding a good portfolio in terms of return and risk! For more on risk management, please see the code notes pack on risk management.
SAS
Pulling this off in SAS is much more difficult. First, we have to set up the problem. In the first data step, we create the correlation matrix, but we have to create it as many times as we have simulations. We create this in the Corr_Matrix dataset. Next, we create a Naive dataset that has values of 0 for both stocks and bonds as many times as we have simulations. Think of these as a starting point that we will eventually delete.
Next we use PROC MODEL
to sample the random variables and build in the correlation structure. For each piece we inital set the value of either the stocks or bonds at their respective means. Then we simulate random normal distributions around these means. The Normal
option here only has one input, the variance, since it will simulate around the mean we previously defined. Next we use the solve
option to add in the correlation structure for each iteration in our simulation. From there we have a data
option to store the generated data from the simulations by each iteration i. Lastly, we use a DATA
step on our created data to delete that initial place holder row for each simulation.
data Corr_Matrix;
do i=1 to 10000;
_type_ = "corr";
_name_ = "S_r";
S_r = 1.0;
B_r = -0.2;
output;
_name_ = "B_r";
S_r = -0.2;
B_r = 1.0;
output;
end;
run;
data Naive;
do i=1 to 10000;
S_r=0;
B_r=0;
output;
end;
run;
proc model noprint;
/*Distributiuon of S*/
S_r = 0.0879;/*Mean of S*/
errormodel S_r ~Normal(0.02175625); /*Variance is defined here as 0.1475^2*/
/*Distribution of B*/
B_r = 0.04;/*Mean of B*/
errormodel B_r ~Normal(0.0049); /*Variance is defined here as 0.07^2*/
/*Generate the data and store them in the dataset work.Correlated_X */
solve S_r B_r/ random=30 sdata=Corr_Matrix
data=Naive out=Correlated_X(keep=S_r B_r i _rep_);
by i;
run;
quit;
data Correlated_X;
set Correlated_X;
by i;
if first.i then delete;
rename i=simulation;
rename _rep_=obs;
run;
Now that we have our simulated, correlated returns, we need to calculate the portfolio value over the 30 years. Again, this isn’t as straight forward in SAS datasets since we typically manipulate columns, not rows. First, we will use PROC TRANSPOSE
to create columns for each of the 30 years of returns and let the rows be each simulation. To do this we transpose the variables with the VAR
statement and use the BY
statement to keep the simulations as rows. In the next DATA
step we just perform the portfolio calculation of multiplying the returns across columns. Finally, we transpose the data back with PROC TRANSPOSE
and then plot to see our final distribution with PROC SGPLOT
.
proc transpose data=Correlated_X out=SB30;
var S_r B_r;
by simulation;
run;
data SB30A;
set SB30;
P30 = 500*(1+COL1)*(1+COL2)*(1+COL3)*(1+COL4)*(1+COL5)*
(1+COL6)*(1+COL7)*(1+COL8)*(1+COL9)*(1+COL10)*
(1+COL11)*(1+COL12)*(1+COL13)*(1+COL14)*(1+COL15)*
(1+COL16)*(1+COL17)*(1+COL18)*(1+COL19)*(1+COL20)*
(1+COL21)*(1+COL22)*(1+COL23)*(1+COL24)*(1+COL25)*
(1+COL26)*(1+COL27)*(1+COL28)*(1+COL29)*(1+COL30);
run;
proc transpose data=SB30A out=SB30A_Final;
var P30;
by simulation;
run;
data SB30A_Final;
set SB30A_Final;
Value = S_r + B_r;
run;
proc sgplot data=SB30A_Final;
title '30 Year Value Distribution';
histogram Value;
refline 1000 / axis=x label='Initial Inv.' lineattrs=(color=red thickness=2);
keylegend / location=inside position=topright;
run;
By looking at the histogram above, we see what we would expect. We have a right skewed distribution with almost all possibilities resulting in a higher portfolio value after 30 years. This portfolio doesn’t have as high of a tail as only stocks due to the “safer” investment of the bonds.
I invite you to play around with the balance of money allocated to stocks and bonds as the trade-off might surprise you with finding a good portfolio in terms of return and risk! For more on risk management, please see the code notes pack on risk management.
How Many and How Long
The possible number of outcomes from a simulation is basically infinite. Essentially, simulations sample these values. Therefore, the accuracy of the estimates of these simulations depends on the number of simulated values. This leads to the question of how many simulations should we run.
Confidence interval theory in statistics helps reveal the relationship between the accuracy and number of simulations. For example, if we wanted to know the mean value from our simulation, we know the margin of error around the mean.
\[ MoE(\bar{x}) = t \times \frac{s}{\sqrt{n}} \] In the above equation, \(n\) is the number of simulations that we are running, while \(s\) is the standard deviation of the estimated means from the simulations. Therefore, to double the accuracy, we need to approximately quadruple the number of simulations.
Theory Assessment
In mathematics and statistics, there are popular theories involving distributions of known values. The Central Limit Theorem is one of them. We don’t need complicated mathematics for us to approximate distributional assumptions when we use simulations. This is especially helpful when finding a closed form solution is very difficult, if not impossible. A closed form solution to a mathematical/statistical distribution problem means that you can mathematically calculate the distribution through equations. Real world data can be very complicated and may change based on different inputs which each have their own unique distribution. Simulation can reveal an approximation of these output distributions.
THe next two sections work through some examples of this.
Central Limit Theorem
Assume you do not know the Central Limit Theorem, but want to study and understand the sampling distribution of sample means. You take samples of size 10, 50, and 100 from the following three population distributions and calculate the sample means repeatedly:
- Normal distribution
- Uniform distribution
- Exponential distribution
Let’s use each of our softwares to simulate the sampling distributions of sample means across each of the above distributions and sample sizes!
R
This simulation can easily be done in R without the use of loops. First, we set the sample size and simulation size of the simulation. Next, we just randomly sample from each of the above distributions using the rnorm
, runif
, and rexp
functions. In each of these functions we sample as many observations we would need across all simulations which is the product of the sample size with the simulation size. We then use the matrix
function to take the draws from the random sampling functions and allocate the observations across rows and columns of a matrix. Specifically, we will use the rows as simulations, and columns as samples. Lastly, we just use the apply
function for each matrix and apply the mean
function to each row (or simulation). From there we can plot the distributions of the sample means.
<- 50
sample.size <- 10000
simulation.size
<- matrix(data=rnorm(n=(sample.size*simulation.size), mean=2, sd=5), nrow=simulation.size, ncol=sample.size, byrow=TRUE)
X1 <- matrix(data=runif(n=(sample.size*simulation.size), min=5, max=105), nrow=simulation.size, ncol=sample.size, byrow=TRUE)
X2 <- matrix(data=(rexp(n=(sample.size*simulation.size)) + 3), nrow=simulation.size, ncol=sample.size, byrow=TRUE)
X3
<- apply(X1,1,mean)
Mean.X1 <- apply(X2,1,mean)
Mean.X2 <- apply(X3,1,mean)
Mean.X3
hist(Mean.X1, breaks=50, main='Sample Distribution of Means for Normal Distribution', xlab='Sample Means')
hist(Mean.X2, breaks=50, main='Sample Distribution of Means for Uniform Distribution', xlab='Sample Means')
hist(Mean.X3, breaks=50, main='Sample Distribution of Means for Exponential Distribution', xlab='Sample Means')
The example above shows sampling distributions for the mean when the sample size is 50. Easily changing that piece of the code will allow the user to see these sampling distributions at different sample sizes.
As you can see above, the sampling distribution for the sample means is approximately normally distributed, regardless of the population distribution, as long as the sample size is big enough.
SAS
This simulation can easily be done in SAS with a couple of loops in a DATA
step. First, we set the sample size and simulation size of the simulation with MACRO variables using the %
symbol. Next, we open two DO
loops to loop over the simulation and sample size. Then we just randomly sample from each of the above distributions using the RAND
function with the Normal
, Uniform
, and Exponential
options. We then use the OUTPUT
statement to make sure that the results from each iteration in the loop are stored as a separate row in the dataset. Lastly, we just use the PROC MEANS
procedure to calculate the means across each variable (using the VAR
statement) for each simulation (using the BY
statement). We use the OUTPUT
statement and OUT
option to store the resulting means into separate datasets as different columns - one column for each simulation. From there we can plot the distributions of the sample means using three separate SGPLOT
procedure calls.
%let Sample_Size = 50;
%let Simulation_Size = 10000;
data CLT;
do sim = 1 to &Simulation_Size;
do obs = 1 to &Sample_Size;
call streaminit(12345);
X1 = RAND('Normal', 2, 5);
X2 = 5 + 100*RAND('Uniform');
X3 = 3 + RAND('Exponential');
output;
end;
end;
run;
proc means data=CLT noprint mean;
var X1 X2 X3;
by sim;
output out=Means mean(X1 X2 X3)=Mean_X1 Mean_X2 Mean_X3;
run;
proc sgplot data=Means;
title 'Sampling Distribution of Means';
title2 'Normal Distribution';
histogram Mean_X1;
keylegend / location=inside position=topright;
run;
proc sgplot data=Means;
title 'Sampling Distribution of Means';
title2 'Uniform Distribution';
histogram Mean_X2;
keylegend / location=inside position=topright;
run;
proc sgplot data=Means;
title 'Sampling Distribution of Means';
title2 'Exponential Distribution';
histogram Mean_X3;
keylegend / location=inside position=topright;
run;
The example above shows sampling distributions for the mean when the sample size is 50. Easily changing that piece of the code will allow the user to see these sampling distributions at different sample sizes.
As you can see above, the sampling distribution for the sample means is approximately normally distributed, regardless of the population distribution, as long as the sample size is big enough.
Omitted Variable Bias
What if you leave an important variable out of a linear regression that should have been in the model? From statistical theory it would change the bias of the coefficients in the model as well as the variability of the coefficients, depending on if the variable left out was correlated with another variable in the model. Simulation can help us show this as well as determine how bad these problems could get under different circumstances.
Let’s simulate the following regression model:
\[ Y = -13 + 1.21X_1 + 3.45X_2 + \varepsilon \] Assume that the errors, \(\varepsilon\) in the model above are normally distributed with a mean of 0 and standard deviation of 1.5. Also assume that the predictors \(X_1\) and \(X_2\) are normally distributed with mean of 0 and standard deviation of 1. Build 10,000 linear regressions (each of sample size 50) and record the coefficients from the regression models for each of these simulations:
- Model with both \(X_1\) and \(X_2\) in the model
- Model with only \(X_1\) in the model and \(X_2\) with the correlation between the two as 0.6
- Model with only \(X_1\) in the model and \(X_2\) with the correlation between the two as 0
Let’s see the simulations in each of our softwares!
R
This simulation is a little more drawn out. First, we set the parameters of interest - sample size, simulation size, coefficients for the regression model, and the correlation between the X’s. Next, we calculate the Cholesky decomposition of our correlation matrix to help with the correlating of data. From there we simulate three variables, \(X_1\), \(X_2\), and \(X_{2u}\). The first \(X_2\) we will correlate with \(X_1\), but we will leave \(X_{2u}\) uncorrelated with \(X_1\). Lastly, we correlate \(X_1\) and \(X_2\) as described in the above sections on correlated variables.
<- 50
sample.size <- 10000
simulation.size <- -13
beta0 <- 1.21
beta1 <- 3.45
beta2 <- 0.6
correlation_coef
<- matrix(data=cbind(1,correlation_coef, correlation_coef, 1), nrow=2)
R <- t(chol(R))
U
<- rnorm(sample.size*simulation.size, mean = 0, sd = 1)
X1 <- rnorm(sample.size*simulation.size, mean = 0, sd = 1)
X2 <- rnorm(sample.size*simulation.size, mean = 0, sd = 1)
X2u
<- cbind(standardize(X1), standardize(X2))
Both <- U %*% t(Both)
X1X2 <- t(X1X2) X1X2
Now we need to create our models. In other words, we need to simulate our target variables under the conditions described above. We have two models to simulate. The first is Y.c where we have \(Y\) as a function of \(X_1\), \(X_2\), and our error, \(\varepsilon\), simulated from a normal distribution using rnorm
. The second is Y.u where we have \(Y\) as a function of \(X_1\), \(X_2u\), and our error, \(\varepsilon\), simulated from a normal distribution using rnorm
.
Next, we just create columns for each simulation of our variables in datasets to loop across.
# Time to create the two target variables - one with correlated X's, one without. #
<- data.frame(X1X2)
all_data
$Y.c <- beta0 + beta1*all_data$X1 + beta2*all_data$X2 + rnorm(sample.size*simulation.size, mean = 0, sd = 1.5)
all_data$Y.u <- beta0 + beta1*all_data$X1 + beta2*X2u +rnorm(sample.size*simulation.size, mean = 0, sd = 1.5)
all_data
# Getting all the data in separate columns to loop across. #
<- data.frame(matrix(all_data$X1, ncol = simulation.size))
X1.c <- data.frame(matrix(all_data$X2, ncol = simulation.size))
X2.c <- data.frame(matrix(X2u, ncol = simulation.size))
X2.uc <- data.frame(matrix(all_data$Y.c, ncol = simulation.size))
Y.c <- data.frame(matrix(all_data$Y.u, ncol = simulation.size)) Y.uc
Now we get to run the regressions. We use lapply
to apply the lm
function (for building linear regressions) across all three types of models. The all_results_c object has the linear regressions leaving out a correlated \(X_2\). The all_results_uc object has the linear regressions leaving out a uncorrelated \(X_2u\). The all_results_ols object has the correct linear regressions leaving out no variables. From there, we use the sapply
function to apply the coef
function across each of our linear regressions to get the coefficients from the models. Lastly, we just plot the densities of the coefficients from each of the three scenarios using the density
function.
# Running regressions across every column. #
<- lapply(1:simulation.size, function(x) lm(Y.c[,x] ~ X1.c[,x]))
all_results_c <- lapply(1:simulation.size, function(x) lm(Y.uc[,x] ~ X1.c[,x]))
all_results_uc <- lapply(1:simulation.size, function(x) lm(Y.c[,x] ~ X1.c[,x] + X2.c[,x]))
all_results_ols
# Gathering and plotting the distribution of the coefficients. #
<- sapply(all_results_c, coef)
betas_c <- sapply(all_results_uc, coef)
betas_uc <- sapply(all_results_ols, coef)
betas_ols
par(mfrow=c(3,1))
plot(density(betas_ols[2,]), xlim = c(-1,5), main = "Slope Histogram - OLS with Correct Model")
abline(v = 1.21, col="red", lwd=2)
plot(density(betas_c[2,]), xlim = c(-1,5), main = "Slope Histogram - Correlated X2 Not in Model")
abline(v = 1.21, col="red", lwd=2)
plot(density(betas_uc[2,]), xlim = c(-1,5), main = "Slope Histogram - Uncorrelated X2 Not in Model")
abline(v = 1.21, col="red", lwd=2)
par(mfrow=c(1,1))
From the plots above we can see the impact of leaving out important variables that should be in our model. If our model leaves out a predictor variable that is correlated with another predictor variable, the coefficient of the predictor variable in the model is biased. Here we can see, the vertical lines showing the true value of the coefficient. When the omitted variable is correlated, our distribution of the coefficient of \(X_1\) is not centered around its true value. Even if the omitted variable isn’t correlated, we can see that the standard error (the spread) of the distribution of the coefficient for the variable \(X_1\) is still wider than it would be under the circumstance where we specified the model correctly.
Another thing we can check is how many times we incorrectly, fail to reject our null hypothesis that \(X_1\) isn’t a significant variable in our model. We know it is an important variable since we created the model. However, let’s look at the p-values for the coefficient on \(X_1\) for each of the simulations. We just use the sapply
function to apply the summary
function and extract the coefficients
element from each linear regression. Then we just sum up how many times our p-value was above 0.05. This should NOT be the case too often since we know \(X_1\) should be in our model.
<- sapply(all_results_c, function(x) summary(x)$coefficients[2,4])
pval_c <- sapply(all_results_uc, function(x) summary(x)$coefficients[2,4])
pval_uc <- sapply(all_results_ols, function(x) summary(x)$coefficients[2,4])
pval_ols
sum(pval_ols > 0.05)/10000)*100 (
[1] 1.36
sum(pval_c > 0.05)/10000)*100 (
[1] 0
sum(pval_uc > 0.05)/10000)*100 (
[1] 40.9
As you can see, in the correctly specified model, we incorrectly say \(X_1\) is not statistically significant only 1.39% of the time. However, when we left out our \(X_2\) variable and it wasn’t correlated, we make this mistake 40.84% of the time.
Omitting important variables is a problem. If those variables are correlated with other inputs that are in the model, then our inputs in the model have biased estimations. However, even if they are not correlated and therefore unbiased, we make many more mistakes about our model input’s significance in the model.
SAS
This simulation is a little more drawn out. First, we set the parameters of interest - sample size, simulation size, coefficients for the regression model, and the correlation between the X’s using MACRO variables. Next, we build in the correlation structure between \(X_2\) and \(X_2u\) using the same approach as descibed in the correlation section above using DATA
steps and PROC MODEL
.
%let Sample_Size = 50;
%let Simulation_Size = 10000;
%let Beta0 = -13;
%let Beta1 = 1.21;
%let Beta2 = 3.45;
%let Correlation_Coef = 0.6;
/* Creating correlation matrices */
data Corr_Matrix;
do i = 1 to &Simulation_Size;
_type_ = "corr";
_name_ = "X_1";
X_1 = 1.0;
X_2 = &Correlation_Coef;
output;
_name_ = "X_2";
X_1 = &Correlation_Coef;
X_2 = 1.0;
output;
end;
run;
/* Creating initial values as place holders */
data Naive;
do i = 1 to &Simulation_Size;
X_1 = 0;
X_2 = 0;
output;
end;
run;
/* Generate the correlated X_1 and X_2 */
proc model noprint;
/* Distributiuon of X_1 */
X_1 = 0;
errormodel X_1 ~ Normal(1);
/*Distribution of X_2*/
X_2 = 0;
errormodel X_2 ~ Normal(1);
/* Generate the data and store them in the dataset work.Correlated_X */
solve X_1 X_2 / random = &Sample_Size sdata = Corr_Matrix
data = Naive out = Correlated_X(keep = X_1 X_2 i _rep_);
by i;
run;
quit;
data Correlated_X;
set Correlated_X;
by i;
if first.i then delete;
rename i = sim;
rename _rep_ = obs;
run;
Now we need to create our models. First, we will simulate our errors from the model, \(\varepsilon\) and our uncorrelated \(X_2u\) variable. Then we can merge all of our simulated data together in a DATA
step with the MERGE
statement. With this new combined dataset we have columns of all of our simulated variables across many simulated rows. All we then need to do is combine the columns into our target variables \(Y\) and \(Yu\) for the model with correlated and uncorrelated inputs respectively.
Next, we can easily build .
/* Generate Linear Regression Model Data */
data simulation;
do sim = 1 to &Simulation_Size;
do obs = 1 to &Sample_Size;
call streaminit(112358);
err = sqrt(2.25)*rand('normal');
X_2u = sqrt(1)*rand('normal');
output;
end;
end;
run;
data all_data;
merge simulation Correlated_X;
by sim obs;
Y = &Beta0 + &Beta1*X_1 + &Beta2*X_2 + err;
Yu = &Beta0 + &Beta1*X_1 + &Beta2*X_2u + err;
run;
Now we get to run the regressions. We use PROC REG
to build all of our models by the variable sim using the BY
statement so we get results for each simulation’s rows. We store the output from all these regressions into a dataset called all_results using the OUTEST =
option. We are complete with the simulation. Next we just clean up the results with some DATA
step manipulation for labeling as well as a PROC SORT
to make the visuals displayed in PROC SGPANEL
nicer to see. The options in the plotting procedure aren’t covered in detail here as that is not the focus.
/* Run all regressions and store the output */
proc reg data = all_data outest = all_results tableout noprint;
OLSC: model Y = X_1 X_2;
Corr: model Y = X_1;
UnCorr: model Yu = X_1;
by sim;
run;
/* Give meaningful values to the _model_ variable */
data all_results;
set all_results;
label _model_="Model";
if upcase(_model_) eq "Corr" then _model_="Omit Correl. X";
else if upcase(_model_) eq "UnCorr" then _model_="Omit Uncorrel. X";
else if upcase(_model_) eq "OLSC" then _model_="Correct Model";
run;
/* Normal plots and univariate statistics for each variable/model */
proc sort data=all_results;
by _model_;
run;
/* Create a panel of plots, by model, to contrast normal vs. empirical distr. AKA Make things look pretty... */
/* Also add a reference line that highlights the true parameter value */
proc sgpanel data = all_results(where=(_type_ eq 'PARMS'));
panelby _model_ / layout=ROWLATTICE uniscale=column columns=1 onepanel;
density X_1 / type=kernel;
refline &Beta1 / axis=x label="&Beta1" labelattrs=(weight=bold);
run;
From the plots above we can see the impact of leaving out important variables that should be in our model. If our model leaves out a predictor variable that is correlated with another predictor variable, the coefficient of the predictor variable in the model is biased. Here we can see, the vertical lines showing the true value of the coefficient. When the omitted variable is correlated, our distribution of the coefficient of \(X_1\) is not centered around its true value. Even if the omitted variable isn’t correlated, we can see that the standard error (the spread) of the distribution of the coefficient for the variable \(X_1\) is still wider than it would be under the circumstance where we specified the model correctly.
Omitting important variables is a problem. If those variables are correlated with other inputs that are in the model, then our inputs in the model have biased estimations. However, even if they are not correlated and therefore unbiased, we still have wider standard errors (spread) in out estimates which would lead to more incorrect conclusions in hypothesis tests for the included input variable.
Model Evaluation - Target Shuffling
Simulations can also be used to evaluate models. One of the most popular examples is target shuffling. Target shuffling has been around for a long time, but has recently been brought back to popularity by Dr. John Elder from Elder Research. Target shuffling is when you randomly reorder the target variables values among the sample, while keeping the predictor variable values fixed. An example of this is shown below:
In the above picture, we are using two variables to predict the target variable if you are going to buy a product. Those two variables are age and gender. However, we have also shuffled the values (same number of categories, just different order) of the target variable to create columns \(Y_1\), \(Y_2\), etc.
The goal of target shuffling is to build a model for each of the target variables - both the one real one and all the shuffled ones. We will then record the same model evaluation metric from each shuffle to see how well our model performs compared to the random data. The shuffling should remove the pattern from the data, but some pattern may still exist in some of the shuffles due to randomness. This will help us evaluate how well our model did as compared to random as well as tell us the probability we arrived at our results just by chance.
The following sections outline two examples.
Fake Data
We are going to randomly generate 8 variables that follow a normal distribution with a mean of 0 and standard deviation of 8. However, only two of these variables will be related to our target variable:
\[ Y = 5 + 2X_2 - 3X_8 + \varepsilon \] We are going to perform a target shuffle on our target variable \(Y\) and look at a couple of things. First, the adjusted \(R^2\) value from our model compared to the target shuffled (random data) models. Second, the number of significant variables in the target shuffled (random data) models. Remember, there should be no relationship when we shuffle our target so our adjusted \(R^2\) should be close to zero in the target shuffled models and there should not really be any significant variables in the target shuffled models.
Let’s see this example in each of our softwares!
R
Similar to the omitted variable bias example above, we will need to generate predictor variables first and then calculate our target variable from the two that should be in the model. Our first step is to generate a matrix of 100 observations (rows) of 8 variables (columns) all with normal distributions with mean 0 and standard deviation of 1. We will save this as a data frame Fake. Next, we have the Err object that is our model error term, \(\varepsilon\), that has a normal distribution with mean of 0 and standard deviation of 8. Now we create our target variable \(Y\) as defined above and combine all these columns into one data frame Fake.
set.seed(1234)
<- data.frame(matrix(rnorm(n=(100*8)), nrow=100, ncol=8))
Fake <- rnorm(n=100, mean=0, sd=8)
Err <- 5 + 2*Fake$X2 - 3*Fake$X8 + Err
Y <- cbind(Fake, Err, Y) Fake
The next step is to shuffle the target variable. We will create another matrix with 100 observations (rows) and 1000 columns - one column for each shuffle. With a for
loop we can randomly sample from a uniform distribution using the runif
function. We will then order our Y column by these uniform random numbers. This will essentially randomly order the target variable. We do this repeatedly and store these in each column of our 1000 shuffle columns. We rename these columns (Y.1, Y.2, …) and combine them all into one new data frame.
<- 1000
sim
<- matrix(0, nrow=100, ncol=sim)
Y.Shuffle for(j in 1:sim){
<- runif(100)
Uniform <- Y[order(Uniform)]
Y.Shuffle[,j]
}
<- data.frame(Y.Shuffle)
Y.Shuffle colnames(Y.Shuffle) <- paste('Y.',seq(1:sim),sep="")
<- data.frame(Fake, Y.Shuffle) Fake
Now we just run our 1001 and linear regressions. The for
loop will run a linear regression using the lm
function on all the shuffled columns. The summary
function’s adj.r.squared
element is then stored for each target shuffled model. Lastly, we also run the true regression model and get its adjusted \(R^2\) using the same procedure. This makes our 1001 adjusted \(R^2\) metrics. Let’s visualize these adjusted \(R^2\) values on a histogram.
<- rep(0,sim)
R.sq.A for(i in 1:sim){
<- summary(lm(Fake[,10+i] ~ Fake$X1 + Fake$X2 + Fake$X3 + Fake$X4
R.sq.A[i] + Fake$X5 + Fake$X6 + Fake$X7 + Fake$X8))$adj.r.squared
}<- summary(lm(Fake$Y ~ Fake$X1 + Fake$X2 + Fake$X3 + Fake$X4
True.Rsq.A + Fake$X5 + Fake$X6 + Fake$X7 + Fake$X8))$adj.r.squared
hist(c(R.sq.A,True.Rsq.A), breaks=50, col = "blue", main='Distribution of Adjusted R-Squared Values', xlab='Adjusted R-Squared')
abline(v = True.Rsq.A, col="red", lwd=2)
mtext("True Model", at=True.Rsq.A, col="red")
Our model out performs all of the other models. However, despite not expecting any good results from the shuffled data, we see some adjusted \(R^2\) values that almost reach 0.2. Again, this is due to random chance. This allows us to test our model to see if we just got a data set that is lucky, or if we really have some interesting findings.
Let’s see how many variables were significant in each of our shuffled models. Remember, the data has been completely shuffled so there shouldn’t be any significant variables. TO do this, use a similar for
loop as above, but instead of looking at the adj.r.squared
element of the summary
function, we look at the \(4^{th}\) column of our coefficient
element - the p-values for our variables. We then count the number of times the p-value was below a significance level of 0.05.
<- NULL
P.Values for(i in 1:sim){
<- summary(lm(Fake[,10+i] ~ Fake$X1 + Fake$X2 + Fake$X3 + Fake$X4
P.V + + Fake$X5 + Fake$X6 + Fake$X7 + Fake$X8))$coefficients[,4]
<- rbind(P.Values, P.V)
P.Values
}
<- P.Values < 0.05
Sig
colSums(Sig)
(Intercept) Fake$X1 Fake$X2 Fake$X3 Fake$X4 Fake$X5
1000 51 46 53 53 43
Fake$X6 Fake$X7 Fake$X8
53 46 50
table(rowSums(Sig)-1)
0 1 2 3 4
675 264 53 7 1
hist(rowSums(Sig)-1, breaks=25, col = "blue", main='Count of Significant Variables Per Model', xlab='Number of Sig. Variables')
Most of the time we see that no variables were significant, but out of the 1000 shuffles, each variable was significant about 50 times. This isn’t surprising as the goal of a significance level of 0.05 in a hypothesis test is to make an error 5% of the time. However, sometimes these errors happened together. Seven out of the 1000 shuffled models had three significant variables in the model. One even had four significant variables. This is all due to random chance.
SAS
Similar to the omitted variable bias example above, we will need to generate predictor variables first and then calculate our target variable from the two that should be in the model. Our first step is to generate a dataset of 100 observations (rows) of 8 variables (columns) all with normal distributions with mean 0 and standard deviation of 1. Next, we have the Err object that is our model error term, \(\varepsilon\), that has a normal distribution with mean of 0 and standard deviation of 8. Now we create our target variable \(Y\) as defined above and combine all these columns into one dataset Fake using the DATA
step and the RAND
function with the Normal
option. Don’t forget the OUTPUT
statement to store your DO
loop results at each iteration.
data Fake;
do obs = 1 to 100;
call streaminit(12345);
X1 = RAND('Normal', 0, 1);
X2 = RAND('Normal', 0, 1);
X3 = RAND('Normal', 0, 1);
X4 = RAND('Normal', 0, 1);
X5 = RAND('Normal', 0, 1);
X6 = RAND('Normal', 0, 1);
X7 = RAND('Normal', 0, 1);
X8 = RAND('Normal', 0, 1);
Err = RAND('Normal', 0, 8);
Y = 5 + 2*X2 - 3*X8 + Err;
output;
end;
run;
The next step is to shuffle the target variable. We will do this using a MACRO we create called shuffle. This MACRO will use a DATA
step to create a random sample from a uniform distribution using the RAND
function with the Uniform
option. We will then order our Y column by these uniform random numbers using PROC SORT
. This will essentially randomly order the target variable. We use another DATA
step to merge this new randomly shuffled dataset with our original one using the MERGE
statement. We then use PROC DATASETS
to delete this random shuffle dataset to not clog up our memory. Next we create another MACRO called sim_shuffle that runs the shuffle MACRO a specified number of times. We will run ours 500 times.
%macro shuffle(sim);
data shuffle_∼
set Fake;
Uni = RAND('Uniform');
Y_&sim = Y;
keep Y_&sim Uni;
run;
proc sort data=shuffle_∼
by Uni;
run;
data Fake;
merge Fake shuffle_∼
drop Uni;
run;
proc datasets noprint;
delete shuffle_∼
run;
quit;
%mend shuffle;
%macro sim_shuffle(sim);
%do i = 1 %to ∼
%shuffle(&i);
%end;
%mend sim_shuffle;
%sim_shuffle(500);
Now we just run our 1001 and linear regressions. The PROC REG
procedure will run a linear regression on the original target \(Y\) and on all the shuffled columns. We will use the SELECTION = BACKWARD
option to backward select our significant variables with a significance level of 0.05. We store the adjusted \(R^2\) values using the OUTEST
option. Let’s visualize these adjusted \(R^2\) values on a histogram with PROC SGPLOT
.
proc reg data=Fake noprint outest=Rsq_A;
model Y Y_1-Y_500 = X1-X8 / selection=backward slstay=0.05 adjrsq;
run;
quit;
proc sgplot data=Rsq_A;
title 'Distribution of Adjusted R-Squared';
histogram _ADJRSQ_;
refline 0.178 / axis=x label='True Model' lineattrs=(color=red thickness=2);
keylegend / location=inside position=topright;
run;
Our model out performs all of the other models. However, despite not expecting any good results from the shuffled data, we see some adjusted \(R^2\) values that almost reach 0.2. Again, this is due to random chance. This allows us to test our model to see if we just got a data set that is lucky, or if we really have some interesting findings.
Let’s see how many variables were significant in each of our shuffled models. Remember, the data has been completely shuffled so there shouldn’t be any significant variables. TO do this, use a similar for
loop as above, but instead of looking at the adj.r.squared
element of the summary
function, we look at the \(4^{th}\) column of our coefficient
element - the p-values for our variables. We then count the number of times the p-value was below a significance level of 0.05.
proc sgplot data=Rsq_A;
title 'Distribution of Number of Significant Variables';
histogram _IN_;
run;
Most of the time we see that no variables were significant, but out of the 1000 shuffles, each variable was significant about 50 times. This isn’t surprising as the goal of a significance level of 0.05 in a hypothesis test is to make an error 5% of the time. However, sometimes these errors happened together. Seven out of the 1000 shuffled models had three significant variables in the model. This is all due to random chance.
Student Grade Analogy
Target shuffling is essentially a simulation version of permutation analysis. This can be easily seen with an example using student grades. Imagine there were 4 students in a class. One studied 1 hour for an exam, another 2 hours, another 3 hours, and the last one 4 hours. Their grades on the exam were 75, 87, 85, and 95 respectively. If you were to plot these grades on a scatter plot and run a linear regression you would have the following:
This model has an \(R^2 = 0.83\). What if the professor randomly shuffled these grades among the four students? How many different ways can four students get the grades of 75, 87, 85, and 95? There are 24 possible ways which are all listed below:
Of these 24 grades, there are 3 possible combinations that produce a linear regression with an \(R^2\) value that is greater than or equal to our actual data. Two of these are plotted below (the third is the inverse of our above plot):
Therefore, there are 4 combinations of grades that produce a regression with an \(R^2\) that is greater than or equal to our actual data - a chance of \(4/24 = 1/6 = 16.67\%\). This problem didn’t need to be target shuffled as it was easy to calculate all possible combinations of the target variable (\(4! = 24\)). However, what if we had 40 students instead of 4? The number of permutations of the target variable increases tremendously (\(40! = 8.16 \times 10^{47}\)). This is the value of target shuffling. Target shuffling is essentially a simulated sampling of permutation analysis.