As the field of big data processing continues to grow, so too do the challenges of debugging complex, distributed data processing systems. Traditional debugging approaches, designed for "small data" applications, are often inadequate for the scale and complexity of big data systems. As a result, programmers of big data applications face a range of unique challenges that require new approaches to debugging.
The Changing Face of Debugging
Frameworks like Apache Spark, Apache Flink, and Apache Wayang (Incubating), have made big data processing much easier. However, big data debugging is still in its infancy. Suppose an analytic task on a terabyte of data failed to produce the expected results. There are two common, but ineffective, approaches to debugging this analytical task:
- Traditional debugging tools: The first approach brings the tools developed for "small data" debugging to big data. One could attach a debugger to a remote Spark process and try the traditional mechanisms, such as issuing watchpoints, pausing the Spark runtime, and stepping the code line by line. This approach is expensive as it results in the pausing of the entire Spark runtime. Furthermore, due to the sheer size of the data, one cannot simply step through the code and watch the intermediate results for each tuple. Doing so is extremely time-consuming.
- Local machine evaluation: The second approach tries to evaluate the task on a local machine over a sample of the input dataset. This is based on the fact that erroneous outputs are typically triggered by a small fraction of data. Therefore, one could take a sample of the input dataset and evaluate it on a local machine. If the sample does not trigger the issue, try a larger sample and so on. Eventually, the data becomes too large to hold in a single machine and/or use traditional debugging techniques. This approach is doomed to fail too.
We make the following three observations:
- Most of the bugs are often caused by the interplay between code and data. Traditional debugging tools are designed for code debugging and not data debugging.
- Traditional debugging tools are not appropriate for distributed debugging. Typical data processing jobs involve hundreds of tasks that are run on dozens of workers generating a huge amount of intermediate data.
- Recent attempts for big data debugging are ad hoc, task-specific, and inflexible. There is a need for an abstraction that can address the code-data distributed debugging while hiding the internal complexity of the system.
Data Debugging Challenges
One of the biggest challenges of big data debugging is the interplay between code and data. Most bugs are caused by complex interactions between code and the massive amounts of data being processed, yet traditional debugging tools are designed for code debugging and not data debugging. Furthermore, big data systems are inherently distributed, involving hundreds of tasks that are run on dozens of workers, generating a huge amount of intermediate data. Traditional debugging tools are not appropriate for distributed debugging, and recent attempts at big data debugging have been ad hoc, task-specific, and inflexible.
Data Debugging Modes
As already discussed, debugging big data jobs can be a challenging task, especially when dealing with a large-scale system. In the world of big data, two major modes of debugging are commonly used: online and post-hoc modes. In this post, we will discuss each of these modes in detail, and highlight their strengths and limitations.
Online mode. Online debugging happens when the main dataflow job is still alive. Users can inspect intermediate results and do trial-and-error debugging. Providing such verisimilitude is quite challenging as popular data processing systems operate in a batch mode. If one pauses the dataflow job, this could potentially pause the computation done by thousands of workers. This results in reduced throughput and wastage of processing resources. Ideally, the online mode should:
- Allow a user to inspect intermediate results with or without pausing the dataflow execution.
- Provide a set of primitives so that a user can select intermediate data relevant for debugging programmatically.
Very few systems provide support for online big data debugging.
Post-hoc mode. This is the most common mode for big data debugging. Users instrument the main dataflow job to dump information into a log. One can then write another job (e.g., in Spark) to analyze the log and identify the issue. While common, this approach of using log files is often not sufficient. This is because a logical view is not available in the logs, such as which input records produce a given intermediate result or the eventual output (i.e., lineage). This information is often invaluable for effective debugging. Ideally the post-hoc mode should allow a user to:
- Get the logical view of the job without any effort, and
- Provide an easy way to express common post-hoc debugging scenarios.
As for online data debugging, few systems provide extensive support for post-hoc debugging. Most of them support specific scenarios, such as lineage or task replay, and cannot be easily generalized to others.
We need Native Big Data Debuggers
In conclusion, debugging big data processing systems is a complex and challenging task that requires careful consideration of the most popular debugging tasks and their corresponding requirements. As shown in Table 1, there are seven major categories of common debugging tasks, and very few systems can support all of them. The development cost of implementing task-specific variants can be significant, highlighting the need for primitives that are concise, flexible, and provide support for both monitoring and debugging. Moreover, for a data debugging system to be effective, it must provide holistic support for these debugging primitives, handle common debugging scenarios with no changes to the main dataflow job, and have detailed granularity at different levels with low overhead to the main dataflow job. By meeting these desiderata, debugging big data processing systems can be made easier, faster, and more efficient, ultimately enabling organizations to achieve their big data processing goals with greater confidence.
In our upcoming post, we're excited to introduce you to TagSniff, a revolutionary data debugging model that simplifies the task of debugging big data tasks. We'll dive into the technical details of how TagSniff works and share how it can be applied to various big data processing scenarios. Additionally, we'll discuss our ongoing efforts to integrate the TagSniff model into Apache Wayang (Incubating), a popular big data processing framework, making it available to a wider range of users. Stay tuned to learn more about how TagSniff can help streamline your big data processing workflow and improve the reliability and efficiency of your data.
The cool part? Blossom Sky works hand-in-hand with top data frameworks like Databricks, Snowflake, Cloudera, and others, including Hadoop, Teradata, and Oracle. Plus, it's fully compatible with AI favorites like TensorFlow, Pandas, and PyTorch. We've made sure it fits right into your existing setup.
Blossom Sky is the commercial version of Apache Wayang, and we're proud to offer it as Open Source Software. You can check out our public GitHub repo right here. If you're enjoying our software, we'd love your support - a star ⭐ would mean a lot to us!
If you need professional support from our team of industry leading experts, you can always reach out to us via Slack or Email.