Federated data analysis: how to get started?

Overview of federated analysis and federated learning in comparison to centralized analysis

Since you have come here, you likely have already encountered the challenge of how to best perform data analysis across multiple datasets. Finding the right way to do this is important to profit from the increased statistical power that comes from an increased data resource. The availability of multiple, comparable datasets is also essential for the assessment of reproducibility and the validation of your findings.

So, why is the analysis of multiple datasets challenging? There are two main answers to this question: comparability and accessibility. Comparability means that the datasets need to be similar with regards to the investigated measures. But they also need to be comparable with regards to how these measures were processed, meaning their derivation from the original raw data. As the name suggests, accessibility means that you somehow need to get “access” to the data, in order to analyze it. This sounds easier than it often is in practice, because many datasets contain sensitive information and cannot be easily shared. Others may be too large to transfer or too difficult to exchange for other reasons.

These issues led to the development of a range of techniques that we refer to in this post as “federated” data analysis and federated learning. Let’s get started with an important question:

What are federated data analysis and federated learning?

Federated data analysis describes an analysis that is performed on multiple (often geographically) separated datasets. During this analysis, the data is not exchanged and can stay, for example, behind a given institution’s firewall. Only parameters of the analysis method are exchanged between the data-hosting sites, and these parameter sets are not allowed to contain information that may endanger potentially sensitive information (more on this later).

There are several technological implementations of this concept, which we will cover only in later posts, as these are not essential to understand the pros and cons of the overall approach. It is important to note that federated data analysis comprises numerous types of analytics approaches, including classical statistics and machine learning. In particular the latter is frequently called federated learning.

What are the advantages of federated analysis?

The main advantage of federated analysis is the ability to access datasets that cannot be shared. If your federated analysis system is set-up properly, there is no danger of releasing sensitive information unintentionally. This means that with access to a large federated data resource, you will be able to safely profit from a substantially increase sample size, the ability to validate your findings in independent cohorts, and the opportunity to test new hypotheses if additional data are available for the federated resources.

The ability to perform federated analyses is particularly useful in consortia projects that intend to analyze individual-level data, which is substantially more powerful than the exchange and meta-analytic integration of group-level statistics. Federated analysis can also simplify the logistics of an otherwise difficult exchange of data resources, and facilitate analyses on data that are simply too large to share. Furthermore, solutions found by federated analysis methods usually yield the same results as their non-federated versions, so there is no effect on the quality of the obtained solutions.

What you need to consider when implementing federated analysis?

Communication and computational complexity. Communication is essential for performing successful data analysis projects, and this applies especially for federated projects. Data need to be made comparable without the ability to run a single processing script on the combined data resource. This means participating institutions need to communicate, agree on data standards as well as the best processing strategy, and then apply these to the different datasets.

The second consideration is computational complexity. Federated analysis systems are more involved than centralized versions, and may seem daunting at first. Not all algorithms can be easily turned into federated applications, since the parameters they require may not be safe to exchange across the network. Think about a complex neural network for example. Here, exchanging the large number of parameters may allow an attacker to reverse-engineer the data and get access to sensitive information. Preventing these potential issues requires an understanding of the mathematical operations a given analysis method performs. Therefore, it is a good idea to start with a simpler analysis approach first, in order to understand how the parameter exchange works. Or simply use methods that are already implemented and that have been shown to be safe. There are also technological developments (such as differential privacy) that can be deployed in conjunction with federated learning, in order to provide another layer of safety to the parameter exchange.

Another consideration that is becoming increasingly relevant with the increasing size of data resources is computational speed. Since federated analyses require the repeated exchange of analysis parameters across the network, they can be relatively slow when applied to very large scale data resources.

What you need to get started

Look out for technological solutions for federated analyses that suit your needs. For example, for R-based analyses, DataSHIELD (see below) provides an infrastructure for federated analysis.

Get an overview of available data resources. When data is federated, it becomes particularly challenging for all involved analysts to retain a good overview of the available data infrastructure. On foldercase you can record datasets and information (e.g. feature names) necessary for defining analysis strategies. You can allow your colleagues to see this information, so you can advance efficiently with your federated analysis project. Chek out our blog post on how to organize your research and learn how to run data-focused collaborations on foldercase.

Communicate. Agree with your collaborators or data providers on data standards, processing strategies and analysis pipelines. On foldercase, you can manage this communication process, provide the necessary information to your colleagues, and record data that have been processed in a harmonized way.

Agree with your colleagues on data governance. How do you agree on the nature of the analyses performed on the distributed data resource?

Understand the analysis tools that you want to deploy on the federated system. Which parameters are exchanged? Has this method been shown to protect sensitive information?

Use simulated data first. To understand the functionality of the method and the nature of the parameter exchange, it is a good idea to start your analysis with simulated data. This will protect sensitive information in case something goes wrong in the beginning.

Example of a federated analysis system

DataSHIELD. DataShield is an open-source software for federated analysis using the statistical programming language R. It uses an Opal database that is deployed behind a given institution's firewall and that holds the data used for federated analysis. Analysis parameters are exchanged between the data holding sites via secure web services.

Ready to analyze?

Federated analysis is a powerful tool for the safe analysis of individual-level data that cannot be combined into a single storage system. Considering the aspects mentioned above will help you get started in the collaborative process of federated data analytics.