Recently, Paul King (V.P. and Chair of Groovy PMC) highlighted the big data abstraction  that Apache Wayang  provides. He mainly showed that users specify an application in a logical plan (a Wayang Plan) that is platform agnostic: Apache Wayang, in turn, transforms a logical plan into a set of execution (physical) operators to be executed by specific underlying processing platforms, such as Apache Flink and Apache Spark.
In this post, we elaborate on the cross-platform optimizer that comes with Apache Wayang, which decides how to generate execution plans. When a user specifies an application on the so called Wayang plan,
Apache Wayang runs an optimisation process that decides the right execution platform (e.g., Apache Flink) to execute each operator in the Wayang plan so that the overall execution time (or monetary cost) is reduced. All this without users noticing it!
Cross-Platform Data Processing
Today’s data analytics often need to perform tasks on more than one data processing platform, that is they are cross-platform analytics. We have identified four situations in which an application requires support for cross-platform data processing:
- Platform independence. Applications run an entire task on a single platform but may require switching platforms for different input datasets or tasks usually with the goal of achieving better performance. Paul King has highlighted this case is his blog post .
- Opportunistic cross-platform. Applications might benefit performance-wise from using multiple platforms to run one single task. We will highlight this case in this post.
- Mandatory cross-platform. Applications may require multiple processing platforms because the platform where the input data resides, e.g., PostgreSQL, cannot perform an incoming task, e.g., a machine learning task. Thus, data should be moved from the platform in which it resides to another platform to be able to run the incoming task.
- Polystore. Applications may require multiple processing platforms because the input data spread across several data stores, e.g., in a data lake setting.
The current practice to cope with cross-platform requirements is either to build specialized systems that inherently combine two or more platforms. The first approach results in being tied to specific platforms, which can either become outdated or be outperformed by newer ones. Re-implementing such specialized systems to incorporate newer systems is very often prohibitively time-consuming. Although the second approach is not coupled with specific platforms, it is expensive, error-prone, and requires expertise on different platforms to achieve high efficiency.
Apache Wayang: a Systematic Solution for Cross-Platform Data Processing
The research and industry communities have identified the need for a systematic solution that decouples applications from the underlying processing platforms and enables efficient cross-platform data processing, transparently from applications and users.
The ultimate goal would be to replicate the success of DBMSs for cross-platform applications: Users formulate platform-agnostic data analytic tasks, and an intermediate system decides on which platforms to execute each (sub)task with the goal of minimizing cost (e.g., runtime or monetary cost).
The key component of Apache Wayang to realize this is its cross-platform optimizer. More concretely, Wayang’s optimizer tackles the problem of finding an execution plan able to run across multiple platforms that minimizes the execution cost of a given task. Let us explain the cross-platform optimization of Apache Wayang via a running example.
Figure 1 shows a Wayang plan for the stochastic gradient descent (SGD) algorithms when the initial data is stored in a database. In more detail, the input data points are read via a TableSource and filtered via a Filter operator. Then, they are (i) stored into a file for visualization using a CollectionSink and (ii) parsed using a Map, while the initial weights are read via a CollectionSource. The main operations of SGD (i.e., sampling, computing the gradients of the sampled data point(s), and updating the weights) are repeated until convergence (i.e., the termination condition of RepeatLoop). The resulting weights are output in a collection.
Given this input plan, the cross-platform optimizer passes the Wayang plan into several phases: the plan inflation, operator costs, movement costs, and plan enumeration phases.
Figure 2 depicts the workflow of Wayang’s optimizer. At first, given a Wayang plan, the optimizer passes the plan through a plan enrichment phase where it inflates the input plan by applying a set of mappings to actual execution operators. In other words, these mappings list how each of the platform-agnostic Wayang operators can be implemented on the different platforms with execution operators. The resulting inflated Wayang thus contains all its execution alternatives. The optimizer then annotates the inflated plan with estimates for both data cardinalities and the costs of executing each execution operator. Next, it takes a graph-based approach  to determine how data can be moved most efficiently among different platforms and annotates the inflated plan accordingly. It then uses all these annotations to determine the optimal execution plan via an enumeration algorithm. Eventually, the resulting execution plan can be enacted by the executor of Apache Wayang on all the selected processing platforms.
For example, Wayang’s optimizer outputs the execution plan illustrated in Figure 3 for our SGD example in Figure 1.
The above plan shows the execution plan for the SGD Rheem plan when Postgres, Spark, and JavaStreams are the only available platforms. This plan exploits Postgres to extract the desired data points, Spark’s high parallelism for the large input dataset, and at the same time, the low latency of JavaStreams for the small collection of centroids. Also note the three additional execution operators for data movement (Results2Stream, Broadcast) and to make data reusable (Cache).
What is the benefit?
You may also be wondering what the advantage of these hybrid plans is in terms of the output of Apache Wayang.
We observe that the cross-platform optimizer allows Apache Wayang to run the SGD tasks more than one order of magnitude faster than any single-platform execution (Apache Spark, Apache Flink, or stand-alone Java): Apache Wayang can execute the SGD task in a few seconds, while all other processing platforms do so in the order of minutes!
What Do Apache Wayang’s Users Have to Do?
Actually, Wayang’s users have nothing to do besides declare their available processing platforms. For example, taking the following code snippet from Paul’s blog post ,
Users simply have to enable all the platform plugins, instead of selecting only one: that is the .withPlugin(Java.basicPluging()) and .withPlugin(Spark.basicPluging()) lines must be active in the above code snippet. Eventually, users can add any other available processing platform they might have.
Users simply specify their tasks in Apache Wayang in a platform-agnostic manner and let Wayang do the rest for them to achieve the best performance!
Apache Wayang at the Core of DataBloom AI's Virtual Data Lakehouse, Blossom Sky
Blossom Sky has Apache Wayang at its core and extends it with new features that Wayang does not have today, such as powerful ML-based query optimiser, federated learning, data debugging and a compliant SQL optimizer.
 Wayang with Groovy: https://blogs.apache.org/groovy/entry/using-groovy-with-apache-wayang
 Apache Wayang: https://wayang.apache.org/
 Sebastian Kruse, Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Sanjay Chawla, Felix Naumann, Bertty Contreras-Rojas: Optimizing Cross-Platform Data Movement. ICDE 2019: 1642-1645
The cool part? Blossom Sky works hand-in-hand with top data frameworks like Databricks, Snowflake, Cloudera, and others, including Hadoop, Teradata, and Oracle. Plus, it's fully compatible with AI favorites like TensorFlow, Pandas, and PyTorch. We've made sure it fits right into your existing setup.
Blossom Sky is the commercial version of Apache Wayang, and we're proud to offer it as Open Source Software. You can check out our public GitHub repo right here. If you're enjoying our software, we'd love your support - a star ⭐ would mean a lot to us!
If you need professional support from our team of industry leading experts, you can always reach out to us via Slack or Email.