Machine Learning (ML) is a specific subset (branch) of Artificial Intelligence (AI). The main idea of ML is to enable systems to learn from historical data to predict new output values for input events. The beauty is that ML does not require systems to be explicitly programmed to achieve this. All this with little human intervention.With the growing volumes of data in today’s world, ML has gained unprecedented popularity. We can achieve today what was unimaginable yesterday, from predicting cancer risk from mammograms to polyglot AI translators. As a result, ML has become the key competitive differentiator for many companies, leading ML-powered software to quickly become omnipresent in our lives. The key to ML is that the more available data, the better the accuracy of the predictive models.
The Appearance of Distributed ML
While ML has become a quite powerful technology, its hunger for training data makes it hard to build ML models in a single machine. It is not unusual to see training data size in the order of hundreds of gigabytes to terabytes, such as in the Earth Observation domain. This has created the need to build ML models over distributed data over multiple storage nodes.
Distributed ML aims at learning ML models using multiple compute nodes to cope with larger input training data sizes as well as improve performance and models’ accuracy . Thus, distributed ML helps organizations and individuals draw meaningful conclusions from vast amounts of training data. Healthcare and advertising are only two examples of the most common sectors that greatly benefit from distributed machine learning.
There exist two fundamental ways to perform distributed ML: data parallelism and model parallelism . Figure 1 illustrates these two approaches for distributed ML.
In the data parallelism approach, the system horizontally partitions the input training data, usually, it creates as many partitions as there are compute nodes (workers), and distributes each data partition to a different worker. Then, it sends the same model features to each worker, which, in turn, learns a local model using their data partition as input. The workers, then, send their local models to a central place, where the system merges them into a single global model.
The model parallelism approach, in contrast to data parallelism, partitions the model features and sends each model partition to a different worker, which in turn builds a local model using the same input data. That is, the entire input training data is replicated for all workers. Then, the system brings these local models into a centralized place to aggregate them into a single global model.
Yet, although powerful, distributed ML has a core assumption that limits its applicability: one needs to have control and access over the entire training data.
However, in an increasing number of cases, one cannot have direct access to raw data, and hence distributed ML cannot be applied in such cases, for example, in the healthcare domain.
The Emergence of Federated Learning
The concept of FL was first introduced by Google in 2017 . Yet, the concept of federated analytics/databases dates from the 1980s . Similar to federated databases, FL aims to bring computation to where the data is.
Federated learning (FL) is basically a distributed ML approach, but, in contrast to traditional distributed ML, raw data from different workers is never moved out of the workers. The workers own the data, and they are the only ones with control over and direct access to it. Generally speaking, FL allows for gaining experience with a more diverse set of datasets at different independent/autonomous locations.
Ensuring data privacy is crucial in today’s world, when societal awareness of data privacy is rising as one of the main concerns of society. For example, many governmental organizations have written laws, such as GDPR  and CCPA , to control the way data is stored and processed. FL enables organizations and individuals to train ML models across multiple autonomous parties without compromising data privacy. During training, organizations/individuals share their local models to learn from each other’s local models. Thus, organizations and individuals can leverage others’ data to learn more robust ML models than when using their own data alone.
The beauty of FL is that it enables organizations and individuals to collaborate towards a common goal without sacrificing data privacy.
Multiple participants collaboratively train a model with their sensitive data and communicate among them only the learned local model. Figure 2 illustrates the general architecture of federated learning (FL).
FL also leverages the two fundamental execution modes to build models across multiple participants: horizontal learning (data parallelism) and vertical learning (model parallelism).
FL is a powerful technology that allows organizations and individuals to collaborate towards the same goal without sacrificing data privacy.
Old wine in a new bottle? We can at least conclude that FL is a mixture of federated databases with distributed ML.
Yet, there are a few aspects that make FL unique:
In contrast to distributed ML, FL prevents organizations or individuals to access data from other organizations/individuals.
- FL is geo-distributed in essence, while distributed ML is an on-premise technology.
- One of the main goals of FL is safeguarding data privacy, which is a nice-to-have feature in federated databases. Distributed ML does not care because it assumes full control of the data.
- While federated databases assume a relational data model, FL does not make any assumptions about the underlying data model.
 Alon Y. Halevy, Peter Norvig, Fernando Pereira: The Unreasonable Effectiveness of Data. IEEE Intell. Syst. 24(2): 8-12 (2009).
 Diego Peteiro-Barral, Bertha Guijarro-Berd iñas: A survey of methods for distributed machine learning. Prog. Artif. Intell. 2(1): 1-11 (2013).
 Brendan McMahan, Daniel Ramage: Federated Learning: Collaborative Machine Learning without Centralized Training Data. Google AI Blog. April 6, 2017.
 Dennis Heimbigner, Dennis McLeod: A Federated Architecture for Information Management. ACM Trans. Inf. Syst. 3(3): 253-278 (1985).
 General Data Protection Regulation (GDPR): https://gdpr-info.eu/
 California Consumer Privacy Act (CCPA): https://oag.ca.gov/privacy/ccpa
The cool part? Blossom Sky works hand-in-hand with top data frameworks like Databricks, Snowflake, Cloudera, and others, including Hadoop, Teradata, and Oracle. Plus, it's fully compatible with AI favorites like TensorFlow, Pandas, and PyTorch. We've made sure it fits right into your existing setup.
Blossom Sky is the commercial version of Apache Wayang, and we're proud to offer it as Open Source Software. You can check out our public GitHub repo right here. If you're enjoying our software, we'd love your support - a star ⭐ would mean a lot to us!
If you need professional support from our team of industry leading experts, you can always reach out to us via Slack or Email.