These concepts bleed into ideas of machine learning, which is largely focused on high dimensional variable selection and weighting. In this chapter we cover some of the basics and, most importantly, the consequences of over- and under-fitting a model. In our Coursera Data Science Specialization, we have an entire class on prediction and machine learning. So, in this class, our focus will be on modeling. That is, our primary concern is winding up with an interpretable model, with interpretable coefficients.
This is a very different process than if we only care about prediction or machine learning.
Prediction has a different set of criteria, needs for interpretability and standards for generalizability. In modeling, our interest lies in parsimonious, interpretable representations of the data that enhance our understanding of the phenomena under study. Like nearly all aspects of statistics, good modeling decisions are context dependent.
Consider a good model for prediction, versus one for studying mechanisms, versus one for trying to establish causal effects. There are, however, some principles to help you guide your way. Parsimony is a core concept in model selection. The idea of parsimony is to keep your models as simple as possible but no simpler. Simpler models are easier to interpret and are less finicky. Complex models often have issues with fitting and, especially, overfitting. Another principle that I find useful for looking at statistical models is to consider them as lenses through which to look at your data.
I attribute this quote to the great statistician Scott Zeger. Unwin and authors have formalized these ideas more into something they call exploratory model analysis I like this, as it turns our focus away from trying to get a single, best, true model and instead focuses on utilizing models as ways to probe data.
This is useful, since all models are wrong in some fashion. Keep this in mind as we focus on variable inclusion and exclusion in this chapter. He gave this quote regarding weapons of mass destruction read more about it here :. These are things we know that we know. There are known unknowns. But there are also unknown unknowns.
This quote, widely derided for its intended purpose, is quite insightful in the unintended context of regression model selection. Known Unknowns and Unknown Unknowns especially are more challenging to deal with. A central method for dealing with Unknown Unknowns is randomization. Of course, being unobserved, you can never know whether or not the randomization was effective.
For Known Unknowns, those variables we wish we had collected but did not, there are several strategies. For example, a proxy variable might be of use. As an example, we had some brain volumetric measurements via MRIs and really wished we had done the processing to get intra-cranial volume head size. This would be a useless conclusion.
An R Companion to Applied Regression | SAGE Publications Inc
In our case, we were studying humans , we used height, gender and other anthropomorphic measurements to get a good guess of intra-cranial volume. I want to reiterate this point: if the omitted variable is uncorrelated with the included variables, its omission has no impact on estimation. It might explain some residual variation, thus it could have an impact on inference.
Formal theories of inference can be designed around the use of randomization. The following rule prevents us from doing that:. Actually, including any new variables increases the actual not estimated standard errors of other regressors. In addition the model must tend toward perfect fit as the number of non-redundant regressors approaches the sample size.
In this simulation, no regression relationship exists. This reminds us of a couple of things. First, irrelevant variables explain residual variation by chance. And, when evaluating fit, we have to take into account the number of regressors included.
mlr2 - R Notes for Applied Linear Statistical Models James...
We then repeatedly generate data from a model, where y only depends on x1. We do this over and over again and look at the standard deviation of the x1 coefficient. Notice that the standard error for the x1 coefficient goes up as more regressors are included left to right in our vector output. The estimated standard errors, the ones we have access to in a data analysis, may not go up as you include more regressors. Notice that the variance inflation goes up quite a bit more. This is an issue with including variables that are highly correlated with the ones that we are interested in.
In the second, they were correlated and it was much worse. All you need to do is take the ratio of the variances for that coefficient. The idea is that we can obtain these from an observed data set. In other words, from a single observed dataset we can perfectly estimate the relative variance inflation caused by adding a regressor. Thus inclusion of Examination increases the variance of the Agriculture effect by These measure how much variance inflation the variable causes relative to the setting where it was orthogonal to the other regressors.
This is nice because it has a well contained interpretation within a single model fit. So, in general, the VIFs are the most convenient entity to work with. Assuming that the model is linear with additive iid errors, we can mathematically describe the impact of omitting necessary variables or including unnecessary ones.
These two rules follow:. These make sense. Therefore, we would expect a variance estimate that is systematically off biased. We would also expect absence of bias when we throw the kitchen sink at the model and include everything necessary and unnecessary. However, then our variance estimate is unstable the variance of the variance estimate is larger. Ideally, you include only the necessary variables in a regression model. Thus we have to discuss variable selection a little bit.
Automated covariate selection is a difficult topic. It depends heavily on how rich of a covariate space one wants to explore. The space of models explodes quickly as you add interactions and polynomial terms. In addition, principal components or factor analytic models on covariates are often useful for reducing complex covariate spaces. It should also be noted that careful design can often eliminate the need for complex model searches at the analyses stage.
However, control over the design is often limited in data science. I would use a different strategy for prediction. As an example, if I had a significant effect of lead exposure on brain size I would think about the following criticism. Were the high exposure people smaller than the low exposure people.
To address this, I would consider adding head size intra-cranial volume. If the lead exposed were more obese than the non-exposed, I would put a model with body mass index BMI included. Most importantly, it makes you think hard about the questions your asking and what are the potential criticisms to your results. Heading those criticisms off at the pass early on is a good idea. Consider the following example:. That is, Model 3 contains all of the Model 2 variables which contains all of the Model 1 variables.
The P-values are for a test of whether all of the new variables are all zero or not i. So this model would conclude that all of the added Model 3 terms are necessary over Model 2 and all of the Model 2 terms are necessary over Model 1. Generalized linear models GLMs were a great advance in statistical modeling.
- The Radio Red Killer: The Lindsey & Plum Detective Series, Book Seven!
- Read Regression Models for Data Science in R | Leanpub.
- ALSM: Companion to Applied Linear Statistical Models version from CRAN;
The McCullagh and Nelder book 1 is the famous standard treatise on the subject. Recall linear models. Linear models are the most useful applied statistical technique. However, they are not without their limitations. Transformations, such as taking a cube root of a count outcome, are often hard to interpret.
The generalized linear model is a family of models that includes linear models. By extending the family, it handles many of the issues with linear models, but at the expense of some complexity and loss of some of the mathematical tidiness. A GLM involves three components:. The three most famous cases of GLMs are: linear models, binomial and binary regression and Poisson regression.