One of a scientists’ greatest joys is the discovery of something true. An effect that appears over and over again when you perform an experiment under the same condition. Maybe the reason why this is so special is that, in real life, it tends to happen rarely. That’s how things like the “reproducibility crisis” in science comes about.
So why is it so difficult to detect an effect that directly relates to the phenomenon under study? There are many reasons. Some of the most important are limitations of sample size, leading to apparent effects due to sampling variability that will not reproduce, or methodological differences that have more serious consequences on findings than expected.
One of the most essential reason, however, is that pretty much any measure you can take in (bio-) medicine depends on or can be influenced by something else. Not accounting for these effects can lead to false positive findings (effects that are not there), but also false negative findings (effects that are there but that you failed to find). In this post, we will look at how to consider these dependencies.
More specifically, we will look at covariates and confounders. So, let’s start.
What are covariates and confounders?
In this post, we will call “covariate” any variable that you consider during your analysis in addition to your variable of interest. For example, when analyzing the dependency between a blood protein and a given diagnosis, age can be a covariate. In a model that tries to predict blood protein levels based on diagnosis (i.e. blood protein ~ diagnosis), age can be useful to consider if it accounts for variance that is unrelated to diagnosis.
Confounders are a special type of variable. They are simultaneously associated with the variable you try to predict (e.g. blood protein levels) and your variable of interest (e.g. diagnosis). In this scenario, you can easily find spurious associations between blood protein levels and diagnosis, if your diagnostic groups differ in age. These associations are unrelated to any true underlying effect and will not be reproducible in non-confounded datasets.
For an example, see the short video above. There is a difference between groups (e.g. diagnosis) in the variable X, but the variable X, as well as group membership is strongly associated with age.
How to deal with covariates and confounders?
Covariates are typically included in your statistical model, to account for the variance they explain in the variable you try to predict, before testing associations with your variable of interest. You can, of course, include more than one covariate in this procedure. This will likely have a negative impact on the statistical power of your model, but this effect is most often negligible compared to the benefit you obtain from accounting for the variance the covariate explains in the variable you try to predict.
Aspects relevant for multivariate modeling in general still apply, e.g. too strong correlation between the covariates should be avoided, and the assumptions of your model should still be fulfilled. An interesting further read in this context is the article by Westreich and Greenland, on the “Table 2 fallacy”, meaning the dangers of interpreting statistical estimates from multiple covariates and exposure variates included in the same statistical model.
For confounders, this scenario is a bit more complicated. By including the confounder in your statistical model, you will remove variance that is shared with the association you try to identify (e.g. blood protein levels versus diagnosis). This will reduce the size of the underlying effect, may lead to false negative findings, or incorrect interpretations of your results downstream.
This is why it is best to avoid confounding through an appropriate study design. It is also useful to first explore e.g. demographic information before you start analyzing your data, in order to identify the potential presence of confounding.
When you have enough data available, another option is matching. This means that you balance for example your diagnostic groups with regards to potential confounders. The effect is that matching breaks the association between the confounder and your variable of interest, and the confounder is no longer a confounder.
This can be tricky when there is an imbalance in a larger number of variables. Propensity score matching is a technique frequently employed in this scenario, but depending on your dataset, it may not always work well. For a detailed article on the matter of matching, Ho and colleagues is an interesting read. It shows that matching is more complex than just balancing groups for potential confounders and then forgetting about these (more on this in another post).
Which covariates are important?
The set of covariates to consider is a difficult question. Typically, it is determined by expert opinion and often taken from the scientific literature on similar studies. It is recommendable to perform a visual inspection of your data prior to starting any analyses. This will help in identifying unanticipated effects that may have a large impact on your data, such as batch effects, effects of data acquisition sequence, seasonality, etc.
If your data has more dimensions than can be easily visualized, it can be helpful to use tools such as Principal Component Analysis (PCA) that maps your data to the two or three dimensions that contain most information in your dataset. This should enable you to identify visually any substantial irregularities (e.g. a strong clustering structure in your PCA plot indicative of a measurement batch effect).
At this point it is worth mentioning the common practice of including a large set of covariates in an initial model and then allowing a statistical software to perform a stepwise elimination of variables, in order to obtain a “best fitting” model. This procedure can lead to an overfitting of regression models, leading to models that do not generalize well to independent datasets. For a non-technical introduction on this, have a look at the article by Babyak.
What to do if you don’t have data on covariates?
Especially when you are analyzing data that has been acquired as part of a different project, there are chances that information on important covariates are missing. In this case, there may be a possibility to infer these covariates from the data, or to derive proxy measures that are useful as covariates. Examples for this information could be a measure of cigarette smoking habits, predicted from epigenetic data, or genetic population structure, predicted from genome-wide genetic association data.
When unmeasured covariates impact a lot on the overall variance in the data, Principal Component Analysis can be a good choice. For example, measurement batch effects typically account for much more variance in the dataset than can be reasonably expected from an illness-related effect.
Therefore, using the first PCA components (as in the example of genetic population stratification mentioned above) is typically a good idea and does not remove variation related to the variable you’re trying to predict. However, the number of components, as well as the degree to which these do remove useful variance in the data, can be difficult to establish, and create a degree of arbitrariness regarding the covariate selection.
In this scenario, you may like to look into techniques that identify “components” that capture substantial variance in the data, but that are unrelated to a pre-specified set of other variables. One of such techniques is called “Surrogate Variable Analysis” (SVA), which is implemented, for example, in R.
Using SVA, you can identify components that are e.g. not related to your outcome, variable of interest or covariates for which you do have data. The estimated component you can then include in your model together with the recorded covariates. Keep in mind that there is no guarantee that the techniques that estimate components from your data will capture all unmeasured covariate effects.
How to take care of covariates in machine learning studies?
One of the challenges of machine learning is that many of the more complicated (and more powerful) techniques do not allow you to integrate covariates directly into the model building process. For supervised machine learning (i.e. the type where the variable that should be predicted is know to the model), data is thus commonly corrected for covariates prior to the model building phase.
This is typically performed using linear regression, by residualizing each feature against a set of pre-specified covariates. There are several aspects to consider:
The adjustment is performed using a linear model. Covariate effects may, however, be non-linear and some machine learning algorithms (e.g. random forests) are very powerful at picking up non-linear effects. This means your predictions may still be covariate-associated.
The residualization of your data using an additive regression model does not take care of potential, covariate-related interactions in your data, i.e. your machine learning model may pick these up afterwards.
Your covariate adjustment procedure should be considered as a part of the model building process. It is generally easier (and less desirable) to independently adjust different datasets and then try to predict a machine learning algorithm across these, than it is to build your adjustment procedure only on one dataset and transfer this procedure to your test data.
It can be recommended to assess whether the predictions from your machine learning model are associated with any covariates, even if you have included these in an adjustment procedure applied to your data prior to model building.
In collaborative projects, communicate with your colleagues on the harmonization of preprocessing strategies. Foldercase provides you with an infrastructure to share files, and have project-specific discussions, and to record the processing steps of your datasets. Keep in mind that some covariate-related procedures may be best implemented across collaboration partners, e.g. when relating some data derived covariates (such as for genetic stratification) to a common reference database. Also, chek out our blog post on how to organize your research and learn running data-driven collaborative projects on foldercase.
It should be noted that some machine learning studies also include covariates as features in the prediction. This may be meaningful if a generalizable contribution to the predictive capacity of other features can be expected. But using this approach to conclude covariates are not important if they do not rank among the most important predictors can be dangerous. It is still possible that the best ranking predictors are confounded, but preferentially selected by the machine learning model due to their outcome association.
No man is an island – and variables aren’t either. This post hopefully provided you with a non-technical introduction on covariates and confounding and a basis for further reading in this area, which is essential for deriving reproducible results from the analysis of data.