Machine Learning for X and X for Machine Learning

Machine Learning for X and X for Machine Learning
June 15, 2022
-
Dr. Zoi Kaoudi

After the recent advances in Artificial Intelligence (AI), and especially in Machine Learning (ML) and Deep Learning (DL)*, various other computer science fields have gone into a race of "blending" their existing methods with ML/DL. There are two directions to enable such a blend: Either using ML to advance the field, or using the methods developed in the field to improve ML. A commonly used slogan when combining ML with a computer science field is: ML for X or X for ML, where X can be, for instance, any of {databases, systems, reasoning, semantic web}.

In this blog post, we focus on cases where X = big data management. We have already observed works on ML for data management and on data management for ML since several years now. Both directions have a great impact in both academia, with dedicated new conferences popping up, as well as in the industry, with several companies working on either improving their technology with ML or providing scalable and efficient solutions for ML.

Databloom.ai is one of the first companies to embrace both directions within a single product. Within Databloom's product, Blossom, we utilize ML to improve the optimizer and, thus, provide better performance to users but also we utilize well-known big data management techniques to speed up federated learning. Databloom.ai is the first to embrace both directions within a single product.

Machine Learning for Big Data Management and Big Data Management for Machine Learning in databloom.ai
Machine Learning for Big Data Management and Big Data Management for Machine Learning in databloom.ai

ML for Data Management

In a previous blog post we discussed how we plan to incorporate ML into Blossom's optimizer. Having an ML model to predict the runtime of an execution plan can have several benefits. First, system performance can rocket as the optimizer is able to find very efficient plans. For example, for k-means, an ML-based optimizer [1] outputs very efficient plans by finding the right platform combination and achieve 7x better runtime performance than a highly-tuned cost-based optimizer! Second, the hard part of manually tuning a cost model in an optimizer is vanished. With just collected training data (plans and their runtime) you can easily train an ML model.


Data Management for ML

In an earlier blog post we discussed how Blossom, a federated data lakehouse analytics framework, can provide an efficient solution for federated learning following principles from data management. To enable federating learning, we are working towards a parameter server architecture. However, as simplistic parameter servers are very inefficient due to excessive network communication (i.e., large number of messages sent through the network) we utilize data management techniques such as increasing locality, enabling latency hiding, and exploiting bulk messaging. The database community has already exploited such optimizations and showed that they can lead to more than an order of magnitude faster training times.

References

[1] Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Bertty Contreras-Rojas, Rodrigo Pardo-Meza, Anis Troudi, Sanjay Chawla: ML-based Cross-Platform Query Optimization. ICDE 2020: 1489-1500.
[2] Alexander Renz-Wieland, Rainer Gemulla, Zoi Kaoudi, Volker Markl: NuPS: A Parameter Server for Machine Learning with Non-Uniform Parameter Access. SIGMOD Conference 2022: 481-495.

* We use the term ML to refer to both Machine Learning and Deep Learning.
** Icons appearing in the figure were created by Creative Stall Premium - Flaticon

About DataBloom

Blossom Sky is all about taking data collaboration and efficiency to the next level. Our platform tackles the big challenge of data silos, bringing everything together in one easy-to-use system. It's built to work smoothly with a whole range of AI algorithms and models.

The cool part? Blossom Sky works hand-in-hand with top data frameworks like Databricks, Snowflake, Cloudera, and others, including Hadoop, Teradata, and Oracle. Plus, it's fully compatible with AI favorites like TensorFlow, Pandas, and PyTorch. We've made sure it fits right into your existing setup.

Blossom Sky is the commercial version of Apache Wayang, and we're proud to offer it as Open Source Software. You can check out our public GitHub repo right here. If you're enjoying our software, we'd love your support - a star ⭐ would mean a lot to us!

If you need professional support from our team of industry leading experts, you can always reach out to us via Slack or Email.
back to all articlesFollow us on Google News
Ready to Elevate Your Data Experience? Get a quote today!