Debugging is an important part of software engineering. To put it simply, it's when you find and solve issues in computer programs. There are various ways to debug, and all are sometimes long and sometimes painful—we all know that. Debugging is a process we as software developers need to do, but we hardly love it.
In a former blog post, we pointed to the difficulty of debugging big data applications, the problems during the debugging process, and the multi-layered processes created by using federated learning on multiple, independent data stores. Today we’re introducing "TagSniff," a new debugging technique that adds simple instrumentation primitives for online and post-hoc debugging.
TagSniff is part of Blossom Core and feeds into our upcoming AI advisor, a new generative AI to help debug AI pipelines across federated datastores. The Blossom advisory service will detect bugs or invalid data modifiers, trace them, and report the findings back to the data engineers' console. During the process, the advisor learns and can find and fix bugs automatically. The data engineer has the choice to apply suggestions manually or integrate the advisor in a CI/CD pipeline. The advisor will be able to open pull requests and suggest code changes for review. We plan to release a first code preview in Q3 2023, but we can't and won't commit to this timeline at that early stage of development.
We're introduce TagSniff in more detail in this blog post: what it is, its components, and how to use it to debug your code.
The idea behind "TagSniff"
The idea behind TagSniff is a data flow instrumentation approach consisting of two primitives, tag and sniff, operating as debug tuples. A debug tuple is a data structure consisting of multiple parts that flows between the data operators whenever debugging is enabled. The tag primitive adds tags to a tuple, while the sniff primitive identifies tuples requiring debugging or further analysis based on their metadata or values. The unique characteristic of these primitives is that users can easily add custom debugging functionality via user-defined functions (UDF). A TagSniff system refers to any system that implements this abstract debugging model.
The debugging item store
The debug tuple is the tuple on which the TagSniff primitives operate. A debug tuple is composed of the original tuple prefixed with annotations and/or metadata, <|tag1|tag2|..., <tuple>>. Annotations describe how users expect the system to react, while metadata adds extra information to the tuples, such as an identifier. The table below illustrates an example set of annotations.
Tags are inserted by either users or the debugging system and mainly stem from dataflow instrumentation. The users can manipulate these tags to support sophisticated debugging scenarios, such as lineage. To enable this tag manipulation, TagSniff provides the following methods on the debug tuple:
- add_tag (tag: String)
- get_tag (tag: String)
- has_tag (tag: String)
The "tag and sniff" primitives
The TagSniff model provides two primitives, tag and sniff, to instrument the debug tuple. The tag primitive is used for adding tags to a tuple. The input is a UDF that receives a tuple and outputs a new tuple with any new tags users would like to append. The sniff primitive is used for identifying tuples requiring debugging or further analysis based on either their metadata or values. The input is a UDF that receives a tuple and outputs true or false depending on whether the user wants to analyze this tuple or not. Let's take a look at two specific debugging tasks that can be implemented using "TagSniff" without requiring a lot of boilerplate code.
Example 1: Data Breakpoint
Suppose you want to add a data breakpoint in a Spark program that retrieves the top-100 most frequent words. You want to pause the execution of the program whenever it encounters a tuple with a null value, so that you can further inspect it. Here's how you can achieve this using the tag and sniff primitives:
In the above code, the tag primitive adds a "pause" tag to any tuple that contains a null value, while the sniff primitive checks if a tuple has the "pause" tag and returns true if it does, indicating that the execution of the program should be paused at that point.
Example 2: Log
Suppose you want to log any tuple that contains a null value so that you can use it for tracing later on. You need to generate a unique identifier for each tuple and add it to the tuples metadata. Here's how you can achieve this using the tag and sniff primitives:
In the above code, the tag primitive generates a unique identifier for any tuple that contains a null value and adds it to the tuple's metadata along with a "log" tag. The sniff primitive checks if a tuple has the "log" tag and returns true if it does, indicating that the tuple should be logged.
TagSniff in a nutshell
It's worth noting that "TagSniff" was designed to be as simple as possible and is defined at the tuple granularity only. However, you might wonder how to use "TagSniff" on a set of tuples. One approach is to use the tag primitive to add a tag to a tuple that indicates its membership in a set, and then use the sniff primitive to check for the presence of that tag on a per-tuple basis. By providing only two primitives, tag and sniff, the model makes common debugging tasks easy to compose and custom debugging tasks possible. In particular, its main advantage is its flexibility in supporting most online and post-hoc debugging tasks easily and effectively. We will illustrate these two data debugging modes in the following blog post. Additionally, users can easily add custom debugging functionality via user-defined functions, which adds to the model's flexibility.
Overall, the TagSniff model provides a powerful abstraction for data debugging that can be used in a variety of contexts and can significantly reduce the amount of boilerplate code required for debugging tasks.
The cool part? Blossom Sky works hand-in-hand with top data frameworks like Databricks, Snowflake, Cloudera, and others, including Hadoop, Teradata, and Oracle. Plus, it's fully compatible with AI favorites like TensorFlow, Pandas, and PyTorch. We've made sure it fits right into your existing setup.
Blossom Sky is the commercial version of Apache Wayang, and we're proud to offer it as Open Source Software. You can check out our public GitHub repo right here. If you're enjoying our software, we'd love your support - a star ⭐ would mean a lot to us!
If you need professional support from our team of industry leading experts, you can always reach out to us via Slack or Email.