Machine Learning (ML) has not only become omnipresent in our everyday lives (with self-driving cars, digital personal assistants, chatbots, etc.) but has also started spreading to our core technological systems, such as databases and operating systems. In the area of databases, there is a large amount of work aimed at optimizing data management components, from index building, knob tuning to query optimization. Just in query optimization, ML is used in place of many optimizer components, such as cardinality estimation, cost modeling, and join enumeration. In this blog post, we focus on the case of using an ML in place of a cost model and go from the traditional cost-based query optimization to the newly proposed ML-based query optimization.
ML-based query optimization
Generally speaking, given a user query, a query optimizer finds the best way to execute this query (i.e., an execution plan) so that the runtime is minimized. For example, Blossom Sky's query optimizer is responsible for figuring out which platform combination is best to execute a query so that the runtime of the given query is as low as possible. To do so, traditional cost-based optimizers use a cost model (a set of mathematical formulas) that captures the cost of executing a plan and an enumeration algorithm that searches for different plans until it finds the one with the smallest cost. Going from a cost-based query optimizer to an ML-based one, the cost model is replaced with an ML model (usually a regression model) that outputs an estimate of the runtime of a plan [1]. Then the enumeration algorithm simply invokes the ML model while searching different plans to determine the one with the lowest runtime estimate based on the ML model.
The hurdle of training data collections
It is well known by now that ML models can be as good as their training data. It is the same in ML-based query optimization:
" The effectiveness of an ML-based query optimizer highly depends on the quantity and quality of training data as well as the availability of valuable ground-truth labels. "
In this case, a training dataset includes a large set of query plans together with their runtime, which serves as a label. Training data collection is a bottleneck as it requires (i) gathering thousands of heterogeneous query plans and (ii) executing all of them to get their runtime. The latter is a very time-consuming task, as it leads to the execution of not only a large number of plans but also suboptimal ones. To get a better idea of the amount of time required to get these labels, see the two figures below. Collecting labels for only 500 OLAP plans with input data of about 1TB in our four-quad-core-nodes cluster takes almost 10 days, while executing 10,000 plans with just 1GB of data requires a bit more than 4 days. If we extrapolate this to 10,000 plans on 1TB of data, it would require more than 6 months!

Executing 10,000 plans on 1TB of data to collect their labels would require more than 6 months!
Even if logs from previously executed queries are available, the plans in the logs are the ones the optimizer chose to execute and, thus, most of them are (near-)optimal. Training a model with only optimal plans would lead to a biased model.
DataFarm: Generative training data generator for ML-based query optimizers
DataFarm [2] is a framework for efficiently generating training data (query plans with their execution runtime).
DataFarm enables users to reduce the cost of getting labeled query workloads by 54× compared to standard manual approaches.
It is based on a data-driven white-box approach. A user inputs a typically very small set of query plans (e.g., 10), and DataFarm augments it to arrive at thousands of plans and attaches labels (at runtime) with uncertainty values to each generated plan. The figure below shows an overview of DataFarm.

The Abstract Plan Generator learns patterns from the input query workload as Markov Chains and generates new heterogeneous abstract plans (plans without specific UDF values) exploiting the real operators’ distributions. These plans follow the patterns specified in the input workload. For example, if in the input workload a group by operator precedes, in most cases, a sort operator, the same will be true in the generated plans. The abstract plans generated at this phase cannot be executed yet because their operators do not specify user-defined values, such as the join key or the selection predicate, nor the platform on which they have to run.
The Plan Instantiator then receives the generated abstract plans and creates an augmented set of executable plans by instantiating different variants for each abstract plan, e.g., by setting different selection predicates. To achieve that, it exploits the user’s input data metadata. This is crucial so that the generated plans are meaningful and can actually be executed without any exceptions or empty results.
Once we have the generated plans, the Label Forecaster uses an active learning approach to label the generated query workload efficiently. It executes only a few of the generated plans and forecasts the labels of the rest with an interpretable ML model based on a quantile regression forest. It iteratively exploits the uncertainty of the model to select a small number of jobs to execute and outputs the forecasted labels for the non-executed ones along with their uncertainty values. Downstream operations can then leverage these uncertainty values to improve their output, e.g., by using them as a noise indicator.
Human-guided training data generation
As users often know their desired query workload, we have introduced humans in the active learning step of the Label Forecaster. This leads to an increase in the quality of the training data and, thus, the downstream ML model. The human gives insights to the Label Forecaster on which jobs is better executed to get better-estimated labels. As humans alone cannot manually select among thousands of plans, the Label Forecaster iteratively suggests to the user a small set of candidate plans for execution. Then, the user can inspect these candidates and remove or add new plans via an intuitive graphical user interface (GUI). The GUI of DataFarm provides useful insights, such as feature importance and model explanation analysis, so that the users can make informed decisions [3].
Results
To evaluate the quality of DataFarm’s training data, we generated 2000 plans from only 6 initial queries. We construct four training sets that differ in the process of obtaining the labels: The first training set contains ground truth labels that we obtain after executing all jobs, while the other three contain labels acquired by sampling 166 jobs to execute and using the ML model of the Label Forecaster to predict the labels of the rest. We used three different sampling mechanisms: (i) random, (ii) agglomerative clustering (DATAFARM without the human), and, (iii) manually modifying the set suggested by the agglomerative clustering (DATAFARM with the human). Based on these four training sets, we build four ML models, respectively, and predict our input query workload runtimes using each model. The results are shown below.

We observe that the quality of training data generated by DataFarm is as good as the ground truth (green and blue bars, respectively). Most importantly, we observe that
DataFarms for generative AI training with human interaction and reinforced learning outperforms autark and self-training DataFarms as well as the ground truth model.
This result is possible because the user can determine the essential features for the ML model and make modifications according to her observations (adding and removing jobs).
References
[1] Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Bertty Contreras-Rojas, Rodrigo Pardo-Meza, Anis Troudi, Sanjay Chawla: ML-based Cross-Platform Query Optimization. ICDE 2020: 1489-1500.
[2] Francesco Ventura, Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Volker Markl: Expand your Training Limits! Generating Training Data for ML-based Data Management. SIGMOD Conference 2021: 1865-1878.
[3] Robin P. van de Water, Francesco Ventura, Zoi Kaoudi, Jorge Arnulfo Quiane Ruiz, Volker Markl: Farm Your ML-based Query Optimizer’s Food! – Human-Guided Training Data Generation – . CIDR 2022 (abstract).
[4] Robin P. van de Water, Francesco Ventura, Zoi Kaoudi, Jorge Arnulfo Quiane Ruiz, Volker Markl: Farming Your ML-based Query Optimizer’s Food. ICDE 2022 (demo), to appear.
About DataBloom
The cool part? Blossom Sky works hand-in-hand with top data frameworks like Databricks, Snowflake, Cloudera, and others, including Hadoop, Teradata, and Oracle. Plus, it's fully compatible with AI favorites like TensorFlow, Pandas, and PyTorch. We've made sure it fits right into your existing setup.
Blossom Sky is the commercial version of Apache Wayang, and we're proud to offer it as Open Source Software. You can check out our public GitHub repo right here. If you're enjoying our software, we'd love your support - a star ⭐ would mean a lot to us!
If you need professional support from our team of industry leading experts, you can always reach out to us via Slack or Email.