foldercase > blog

Machine learning: an introduction

Overview of machine learning-relevant topics

Nowadays, there is so much information on machine learning available online that it is hard to know where to start. And the field is advancing so quickly that it is difficult to catch up. Since you’ve come here, chances are that you’re looking for an introduction to machine learning that is easy to understand and that provides you with a solid basis for further study if needed. An introduction that doesn’t require you to read an entire book.

You’ve come to the right place.

To provide you with the most important information in as few words as possible, we get started straight away:

What is machine learning?

Machine learning aims to learn dependencies from data. There are two primary types of machine learning (there are others, but you will likely encounter these less frequently):

  1. Supervised machine learning. You build a model that uses input data to predict a variable (e.g. a diagnosis, age, symptom severity, treatment response etc.). Importantly, you provide that variable to the model. The model then finds the optimal combination of input data features to predict this variable.

  2. Unsupervised machine learning. You build a model providing only the input data. Instead of predicting anything, the model identifies structure within the input data, e.g. clusters of observations that are similar with regards to the input features.

Most of the machine learning applications you frequently hear about (facial recognition, autonomous driving, credit card fraud prevention, etc.) belong to the domain of supervised machine learning. Thus, this post will focus exclusively on these (more on unsupervised learning in another post).

How do you build a supervised machine learning model?

One of the simplest examples of a machine learning model, one that you will already know, is linear regression. Linear regression is a machine learning model that predicts some continuous variable Y, using some combination of input data X.

In the simplest case, there is only one variable X, and the association between Y and X is learnt by minimizing the squared difference between the Y predicted by the model and the actual Y (the so-called least squares loss). This results in some slope beta, and an intercept, which are the parameters that specify the model.

This concept is the pretty much the same for most other supervised machine learning models. The main difference is in the method used to find optimal combinations of input features, as well as the resulting “shape” (e.g. linear versus non-linear) of the resulting model.

Some examples of supervised machine learning models

  1. Linear discriminant analysis. Learns a combination of input data features that defines a linear boundary between two groups. The combination is found in the direction that minimizes the within-group variance and that maximizes the between-group variance.

  2. Regression and classification trees . Partitions the data iteratively in the form of a tree, e.g. if a value is <= a, go to left branch, otherwise, go to right branch, then repeat with a different variable. Grow the tree until each branch becomes a “leave” and only contains one observation / one class (more on building good versions of this in another post). This model can be highly non-linear.

  3. Support vector machines. Find observations (support vectors) in your input data space that are on the outer borders of a “hyper-plane” with maximum margin between two classes. The model is defined by these support vectors and can be non-linear.

How good is a machine learning model

The main goal of supervised machine learning is to identify a model that predicts your outcome (e.g. diagnosis, etc.) well. Getting a good estimate of your model’s performance is essential to assess whether the model would be useful in practice, e.g. for future clinical application. There are many different measures to assess model performance. Here are some of the most common used for classification problems:

  1. Error rate. Determines the fraction of observations classified incorrectly. Can be a problematic measure when class sizes are different (e.g. data from many more patients than controls in your data, then predicting a new observation as a patients will always have a low error rate.

  2. Sensitivity and specificity. For the patient versus control example: sensitivity is the fraction of patients classified correctly, specificity is the fraction of controls classified correctly. Takes care of imbalanced class distribution and is very widely used.

  3. ROC-AUC. The Receiver Operating Characteristic – Area Under the Curve. The main issue with sensitivity and specificity is that these relate to a single threshold for class assignment (e.g. model prediction > 0.5 results in an observation being classified as a patient). However, the threshold of 0.5 is potentially not an optimal choice. The ROC-AUC determines sensitivity and specificity for different thresholds (with high thresholds, sensitivity will go down and specificity up, and vice versa). You the plot 1-specifcity against sensitivity for different thresholds, and the ROC-AUC is the area under the curve (a value of 1 is optimal, a value of 0.5 is chance).

One of the most important rules for building supervised machine learning models is that you cannot use your training data for testing the performance of your model. The model will always fit well to the training data, but this is not an honest measure of how well it will perform on unseen data. You need to have independent data for assessing performance, or use cross-validation.

Why do you need cross-validation for machine learning?

Cross-validation means that you randomly allocate your observations (e.g. patients and controls) to different “folds”, i.e. observation sets. Typically you would choose between 5 and 15 folds. Then, you train your model on the observations in all but one fold, and you test the model’s performance on the remaining fold. You repeat this process until you have used each fold once as “test fold”.

This way, you will obtain performance estimates of your model that are obtained from data not used for model training. These measures, which you can e.g. average across folds into one value, will give you a measure of how well your model will likely perform in unseen data.

Especially the more complex machine learning models will be defined by one or more parameters (e.g. a cost parameter for incorrectly classified observations in support vector machines). These parameters need to be optimized for the model to achieve the best possible performance.

Optimizing these typically requires the assessment of model performance for different parameter settings using cross-validation, and again, these performance estimates cannot be derived from training data. In this scenario, you will likely need “nested cross-validation”, which we will cover in the follow-up tutorial to this one.

How to build good machine learning models?

There are some principal rules of how to build good machine learning models. We will only highlight these here, since these have been or will be covered in detail in other posts on the foldercase blog.

  1. Use data that is not confounded. See our introduction on covariates and confounding for details. Building models on confounded data may dramatically over-estimate performance and will not generalize well to unseen data.

  2. Try to use training data that is as close to your potential application scenario as possible. In this way, the model will learn the specific properties that it needs for performing well in real-life application.

  3. Prefer the simpler model if you have a choice. Often, you can decide between more and less complex models with little difference in performance. You will likely find that the simpler one will perform better in unseen data, and will be easier to implement in practice. This is the so-called “Occam’s razor” theory.

  4. Use feature selection. If you have high-dimensional input data, use feature selection to reduce the model to a few important features. Less complex models typically generalize better, see point 3.

  5. Use multiple, independent validation datasets if you can. Cross-validation gives you relatively unbiased estimates of model performance, but models can still “overfit” to the specific properties of the training dataset. Therefore having really independent test data is a far better test for actual generalizability than cross-validation. Chek out our blog post on how to organize your research and start using foldercase to manage multi-site machine learning studies.

  6. Check whether the predictions of your model are confounded. Often, linear adjustments are performed to remove the potential influence of confounders from the training data. Many machine learning models (in particular non-linear versions) can, however, pick-up non-linear, residual confounding that will find its way into the predictions of the model. Assessing confounder-association of the predictions will prevent you from drawing incorrect conclusions.

Why can machine learning be challenging?

Machine learning can be difficult for many reasons, here are some of the most common:

  1. The models are too difficult to learn. This happens when the relevance of individual features in you input data is very low, and many features need to be combined to obtain a good model. With the typical limitations in training data size, obtaining a good model may be challenging.

  2. Your training data may be confounded or not directly representative of your application scenario. For example, sometimes recruiting the right individuals in clinical studies is challenging and machine learning models are built on already existing data that do not completely reflect the application scenario.

  3. The outcome may be very noisy. Think about a measure derived from rating scales or diagnoses based on symptom evaluations. Machine learning models may reach a performance ceiling, as the input data may never accurately capture the outcome as a well-described construct.

  4. Data access is problematic. The ability of machine learning models to integrate large numbers of features can be problematic from a data privacy perspective, because sharing the corresponding data for training and testing may release sensitive information. See our tutorial on federated data analysis and machine learning for a possible solution.

How to get started with machine learning?

If you want to get started with machine learning, practice is essential. Starting with R or python-based tutorials can be recommended, as these languages are widely used for machine learning, provide you with already implemented libraries for a vast number of machine learning algorithms, and have an extensive online community of you need assistance.

Get first-hand experience with building models, seeing how bias can occur when test data are not independent, look at ways to remove and assess confounding in your data. This way, you will slowly build a solid technical basis that will allow you to evaluate the potential and risks of applying machine learning to a given problem, and to build a basis for applying more advanced approaches.