Scalytics | Why Developers need Big Data Debugging

February 17, 2023

As the field of big data processing continues to grow, so too do the challenges of debugging complex, distributed data processing systems. Traditional debugging approaches, designed for "small data" applications, are often inadequate for the scale and complexity of big data systems. As a result, programmers of big data applications face a range of unique challenges that require new approaches to debugging.

‍

The Changing Face of Debugging

Frameworks like Apache Spark, Apache Flink, and Apache Wayang (Incubating), have made big data processing much easier. However, big data debugging is still in its infancy. Suppose an analytic task on a terabyte of data failed to produce the expected results. There are two common, but ineffective, approaches to debugging this analytical task:

Traditional debugging tools: The first approach brings the tools developed for "small data" debugging to big data. One could attach a debugger to a remote Spark process and try the traditional mechanisms, such as issuing watchpoints, pausing the Spark runtime, and stepping the code line by line. This approach is expensive as it results in the pausing of the entire Spark runtime. Furthermore, due to the sheer size of the data, one cannot simply step through the code and watch the intermediate results for each tuple. Doing so is extremely time-consuming.
Local machine evaluation: The second approach tries to evaluate the task on a local machine over a sample of the input dataset. This is based on the fact that erroneous outputs are typically triggered by a small fraction of data. Therefore, one could take a sample of the input dataset and evaluate it on a local machine. If the sample does not trigger the issue, try a larger sample and so on. Eventually, the data becomes too large to hold in a single machine and/or use traditional debugging techniques. This approach is doomed to fail too.

We make the following three observations:

Most of the bugs are often caused by the interplay between code and data. Traditional debugging tools are designed for code debugging and not data debugging.
Traditional debugging tools are not appropriate for distributed debugging. Typical data processing jobs involve hundreds of tasks that are run on dozens of workers generating a huge amount of intermediate data.
Recent attempts for big data debugging are ad hoc, task-specific, and inflexible. There is a need for an abstraction that can address the code-data distributed debugging while hiding the internal complexity of the system.

‍

Data Debugging Challenges

One of the biggest challenges of big data debugging is the interplay between code and data. Most bugs are caused by complex interactions between code and the massive amounts of data being processed, yet traditional debugging tools are designed for code debugging and not data debugging. Furthermore, big data systems are inherently distributed, involving hundreds of tasks that are run on dozens of workers, generating a huge amount of intermediate data. Traditional debugging tools are not appropriate for distributed debugging, and recent attempts at big data debugging have been ad hoc, task-specific, and inflexible.

‍

Data Debugging Modes

As already discussed, debugging big data jobs can be a challenging task, especially when dealing with a large-scale system. In the world of big data, two major modes of debugging are commonly used: online and post-hoc modes. In this post, we will discuss each of these modes in detail, and highlight their strengths and limitations.

Online mode. Online debugging happens when the main dataflow job is still alive. Users can inspect intermediate results and do trial-and-error debugging. Providing such verisimilitude is quite challenging as popular data processing systems operate in a batch mode. If one pauses the dataflow job, this could potentially pause the computation done by thousands of workers. This results in reduced throughput and wastage of processing resources. Ideally, the online mode should:

Allow a user to inspect intermediate results with or without pausing the dataflow execution.
Provide a set of primitives so that a user can select intermediate data relevant for debugging programmatically.

Very few systems provide support for online big data debugging.

Post-hoc mode. This is the most common mode for big data debugging. Users instrument the main dataflow job to dump information into a log. One can then write another job (e.g., in Spark) to analyze the log and identify the issue. While common, this approach of using log files is often not sufficient. This is because a logical view is not available in the logs, such as which input records produce a given intermediate result or the eventual output (i.e., lineage). This information is often invaluable for effective debugging. Ideally the post-hoc mode should allow a user to:

Get the logical view of the job without any effort, and
Provide an easy way to express common post-hoc debugging scenarios.

As for online data debugging, few systems provide extensive support for post-hoc debugging. Most of them support specific scenarios, such as lineage or task replay, and cannot be easily generalized to others.

‍

‍

We need Native Big Data Debuggers

In conclusion, debugging big data processing systems is a complex and challenging task that requires careful consideration of the most popular debugging tasks and their corresponding requirements. As shown in Table 1, there are seven major categories of common debugging tasks, and very few systems can support all of them. The development cost of implementing task-specific variants can be significant, highlighting the need for primitives that are concise, flexible, and provide support for both monitoring and debugging. Moreover, for a data debugging system to be effective, it must provide holistic support for these debugging primitives, handle common debugging scenarios with no changes to the main dataflow job, and have detailed granularity at different levels with low overhead to the main dataflow job. By meeting these desiderata, debugging big data processing systems can be made easier, faster, and more efficient, ultimately enabling organizations to achieve their big data processing goals with greater confidence.

In our upcoming post, we're excited to introduce you to TagSniff, a revolutionary data debugging model that simplifies the task of debugging big data tasks. We'll dive into the technical details of how TagSniff works and share how it can be applied to various big data processing scenarios. Additionally, we'll discuss our ongoing efforts to integrate the TagSniff model into Apache Wayang (Incubating), a popular big data processing framework, making it available to a wider range of users. Stay tuned to learn more about how TagSniff can help streamline your big data processing workflow and improve the reliability and efficiency of your data.

About Scalytics

Legacy data infrastructure can't keep pace with the speed and complexity of modern AI initiatives. Data silos stifle innovation, slow down insights, and create scalability bottlenecks. Scalytics Connect, the next-generation data platform, solves these challenges. Experience seamless integration across diverse data sources, enabling true AI scalability and removing the roadblocks that hinder your AI ambitions. Break free from the limitations of the past and accelerate innovation with Scalytics Connect.

We enable you to make data-driven decisions in minutes, not days
Scalytics is powered by Apache Wayang, and we're proud to support the project. You can check out their public GitHub repo right here. If you're enjoying our software, show your love and support - a star ⭐ would mean a lot!

If you need professional support from our team of industry leading experts, you can always reach out to us via Slack or Email.