Scalytics | Debug Decentral Big Data Apps with TagSniff

February 24, 2023

In our previous post, we introduced the TagSniff model. TagSniff is a data flow instrumentation approach that offers a set of primitives for online and post-hoc debugging tasks. Debugging involves locating and fixing errors in computer programs, and TagSniff helps by adding tags to debug tuples and identifying tuples that need further analysis based on their metadata or values. TagSniff provides two primitives, tag and sniff, for instrumenting the debug tuple, which includes annotations and metadata. In this post, we will discuss how one can use TagSniff for the two common data debugging modes: online and post-hoc debugging.

‍

Online Data Debugging

Debugging is an essential part of any software development process, and it becomes even more critical when it comes to big data applications. Debugging big data applications presents new challenges as it requires a combination of code and data interactivity to resolve any errors. Online debugging is a technique that allows developers to debug big data applications while the job is still running.

Online debugging is crucial for big data applications as it allows developers to add breakpoints for data inspection, be notified when a crash is triggered, and be alerted when certain conditions on the data are met. In big data applications, interactivity between data and code is a critical aspect of online debugging. Hence, new interactivity functionalities are required to make the online debugging process efficient.

One can use TagSniff for online debugging of big data applications. It allows users to inspect the data at various stages of the dataflow, identify the cause of any crashes, and be alerted when certain conditions are met. The power of TagSniff lies in how it reacts to specific tag and sniff calls to support online debugging scenarios.

Data Breakpoints

‍These are a significant aspect of online debugging as they allow developers to inspect the data at various stages of the dataflow. TagSniff provides two interactivity actions, Next tuple and Next operator, to facilitate data breakpoint debugging. Next tuple allows the user to inspect tuples containing a null value, and once the inspection is complete, TagSniff removes the tag pause and sends it to the next operator. Next operator allows the user to check how the tuple is transformed by the downstream operators. TagSniff propagates the tag pause along with the tuple to pause the execution with the sniff function in the downstream operator.

Crash culprits

‍They are a significant cause of concern in big data applications. A crash culprit is a tuple that causes a system to crash, and identifying the culprit tuple and the operator where the exception occurred is essential to resolve the error. TagSniff provides a Catch crash culprits functionality that catches any runtime exception and annotates the tuple with the tag "crash" and the exception trace, operator id, and node IP address. We illustrate these two instances below:

‍

‍

This code applies a crash tag to each tuple in the input dataset and then uses a post-hoc assertion to filter out any tuples that have the crash tag. The tag method adds a crash tag to each tuple with a concatenation of the values of the exception trace TRC, the operator id OID, and the node IP address. The sniff method then applies a filter to the input dataset, checking if each tuple has the crash tag. If the tuple has the crash tag, it is returned for further inspection.

Alert

This functionality notifies the user when a tuple satisfies some condition of interest, such as a bottleneck or a latency metric. TagSniff provides an Alert functionality that allows the user to add conditions on a single tuple or set of tuples. The TagSniff system should call the tag primitive before and after the tuple is executed by the ReduceByKey operator. The sniff primitive retrieves the timestamp metadata from the debug tuple to compute the latency of the ReduceByKey invocation and checks if it is above a specified threshold. Here, a code snippet illustrating these primitives:

‍

Using TagSniff to detect latency between data primitives

‍

As we can see, this code applies a timestamp to each tuple in the input dataset and then uses a post-hoc assertion to filter out any tuples that take longer to process than a predefined threshold. The tag method adds a timestamp tag to each tuple with the current system time in milliseconds. The sniff method then applies a filter to the input dataset, checking the time difference between the two timestamps stored in the tag of each tuple. If the time difference is greater than the predefined threshold, the tuple is returned for further inspection.

‍

Post-Hoc Data Debugging

Post-hoc debugging is an essential step in the data analysis pipeline that helps identify underlying issues in the dataflow execution. It takes place on the execution logs once the main dataflow job finishes. The simple execution logs provide a simplistic view where the input, intermediate, and output tuples are decoupled. Therefore, users can leverage the TagSniff primitives to produce much richer execution logs with a logical view.

There are several post-hoc tasks that users can perform with the help of TagSniff, such as forward and backward tracing, selective replay, and lineage information. Users can analyze these logs to identify the underlying issue. Obtaining lineage information or replaying a part of the dataflow execution for a subset of tuples might require quite some coding expertise. Therefore, it is essential to introduce a set of convenience methods that prevent users from writing many lines of code.
‍

Forward and Backward Tracing

The forward and backward tracing allows users to identify which output tuples were generated from a given input tuple. More generally, this process allows users to understand how a given tuple is transformed by various operators in the dataflow. Conversely, backward tracing allows users to identify the input tuple(s) that generated a given output tuple, which could be construed as a special case of lineage. Suppose a user wants to trace an input tuple throughout the entire dataflow if it contains an empty word. Using the logs, the user can either run an ad-hoc dataflow or run the original dataflow properly instrumented with TagSniff. We argue the latter is much simpler. The tag primitive annotates all tuples containing an empty value as trace, otherwise as skip. Note that, a TagSniff system would apply this tag function at the source operator followed by a sniff function. This sniff function returns true for all tuples because each of them requires the system to act: either display the tuple to the user (trace) or remove the tuple from the dataflow (skip). The code snippet below illustrates the above tag and sniff:

‍

‍

This code applies a tag to each tuple based on whether it equals a space character or not. If the tuple is a space character, it adds the tag "trace"; otherwise, it adds the tag "skip". The sniff function is then used to trigger a debugging output for any tuple that has a tag. In this case, the sniff function simply returns true for any tuple that has a tag, causing the system to show those tuples to the user for further inspection.

‍

Selective Replay

Selective replay allows a user to replay portions of the dataflow graph. Selective replay has several applications, such as understanding how a subset of the dataset is affected by the dataflow, performing interactive queries on intermediate datasets for debugging, and re-executing part of the workflow with modified inputs. Suppose that a user is interested in selectively replaying the execution of the map and reduceByKey operators in the following code snippet that computes the Top-100 frequent words:

‍

Wordcount for top 100 words in a given document

‍

To achieve the above goal, the user can load the intermediate output of line 1 and run the rest of the original data flow instrumented with TagSniff. A TagSniff system would add a tag and sniff primitive after the reduceByKey operator, where the tag primitive adds the skip tag to all tuples output from that operator, and sniff returns true for all tuples. The TagSniff system would then be responsible for removing all tuples tagged as skip from the main dataflow. The code snippet below illustrates these tag and sniff functions:

‍

‍

Post-hoc Assertions

‍Post-hoc assertions are a way to verify if a given condition is satisfied on input, intermediate, or output datasets. This technique can be used to identify errors that may have occurred during data processing. For example, in the Top-100 frequent words example above, one can use post-hoc assertions to verify if all strings passed as input to the map function had at least length 1. This can be achieved using TagSniff, where the tag applies the display tag to tuples that fail to satisfy this condition, and sniff returns true for all tuples that contain the display tag so that a system shows them to the user for further inspection.

‍

Detect errors during a data processing operation with TagSniff

‍

This code applies a post-hoc assertion on the input data to check if every string in the input has a length of at least 2 characters. It adds the "display" tag to any string that fails this check, and then uses sniff to return true for any tuples that contain the "display" tag so that the system can show them to the user for further inspection.

‍

Performance Profiling

‍This is another technique that can be used to identify bottlenecks in large-scale data processing systems. This technique involves analyzing execution logs to understand the dataflow footprint in terms of different performance metrics, such as latency and throughput at the tuple or operator level. Straggler tuples, a pernicious problem in big data analytics, can cause significant performance degradation. Most data processing systems, such as Spark, only provide coarse monitoring support at the job and worker level. However, it is often important to know how long processing each tuple took so that bottlenecks can be identified. TagSniff can be used for performance profiling by running an ad-hoc dataflow on the logs, in case the logged tuples contain proper timestamps. If not, one could perform selective replay with the tag and sniff method. In this case, tag adds a timestamp to the tuple before and after the operator execution, and sniff checks if the latency of the tuple processing is above some threshold. Alternatively, a user could use the assert convenience method to identify straggler tuples. The code snippet below illustrates how to use this method for profiling:

‍

‍

This code applies a post-hoc assertion to the input dataset to check if the processing time of each tuple is above a certain threshold, in order to identify potential straggler tuples. It reads a dataset from a log file using the Reader class, then applies the assert method on the dataset to filter out any tuples that don't meet the specified condition. In this case, the condition is that the time difference between the timestamps stored in the "timestamp" tag of each tuple is greater than a predefined threshold. Finally, it collects and prints out the filtered dataset.

‍

Summary

In conclusion, debugging big data applications requires new interactivity functionalities that traditional debugging methods may not provide. Online debugging with TagSniff provides an efficient way to debug big data applications while the job is still running. With TagSniff, developers can inspect data at various stages of the dataflow, identify crash culprits, and be alerted when certain conditions on the data are met. Online debugging with TagSniff provides a comprehensive and efficient way to resolve any errors in big data applications. Similarly, post-hoc debugging is an important part of the data analysis pipeline that helps identify the underlying issues in the dataflow execution. With the help of TagSniff, users can perform various post-hoc tasks, such as forward and backward tracing, selective replay, and lineage information. Overall, TagSniff provides a powerful framework for applying these techniques to distributed dataflow systems. By using these techniques, developers and data scientists can gain insights into system performance and identify issues that may have occurred during data processing.

About Scalytics

Legacy data infrastructure can't keep pace with the speed and complexity of modern AI initiatives. Data silos stifle innovation, slow down insights, and create scalability bottlenecks. Scalytics Connect, the next-generation data platform, solves these challenges. Experience seamless integration across diverse data sources, enabling true AI scalability and removing the roadblocks that hinder your AI ambitions. Break free from the limitations of the past and accelerate innovation with Scalytics Connect.

We enable you to make data-driven decisions in minutes, not days
Scalytics is powered by Apache Wayang, and we're proud to support the project. You can check out their public GitHub repo right here. If you're enjoying our software, show your love and support - a star ⭐ would mean a lot!

If you need professional support from our team of industry leading experts, you can always reach out to us via Slack or Email.