Blossom Sky: The Most Asked Questions Answered

Blossom Sky: The Most Asked Questions Answered
November 9, 2023
Dr. Zoi Kaoudi

Summary: In this blog post, we answer key questions about the data processing capabilities of Blossom Sky, GDPR compliance, data access delegation and performance optimization. Learn how Blossom Sky can simplify complex data scenarios and empower your organization with effective data management and AI deployment. Explore the challenges of data regulation in today’s AI-driven world, and how Blossom Sky solves the problems of dispersed data silos and unifies different, incompatible data technologies.

How Does Blossom Sky Solve Data Regulation Challenges?

We often get asked how we tackle the rising challenge of data regulations, especially in an era where AI is gaining more and more importance and data is scattered across various silos and technologies, making data management increasingly complex. Questions like this:

We have US customer data on a Spark cloud in NYC, and EU customer data on a SQL Data Warehouse on premise in Paris. The requirement is to find late fees charged by customer account size and country. How does Blossom process data to generate a summary table while meeting GDPR compliance?

Blossom Sky inherently offers data compliance (GDPR, HIPAA, CCPA etc) via its data federation technology. WIth Blossom Sky, the data engineer writes a single Wayang job composed of the selection, projection, and aggregation statements required for the query. This job simply declares the path to the two different sources in a configuration file. Then, Blossom Sky creates two independent Wayang jobs and ships them to the two premises. Each platform, i.e., the Spark cluster in NYC and the SQL data warehouse in Paris, executes the query.

There are three ways to merge the results and generate the summary table, all are data regulation compliant. Use the best method which fits your use case.

Method 1 - Remote Data Federation

When Blossom Sky is executing federated data operations on multiple data sources at the same time, the intermediate results of that operations can be sent to the location where Blossom Sky is currently running. This location could be, for example, the central data team in the US or France. At this location, Blossom Sky combines and integrates these intermediate results into a summary table by using Apache Wayang's data processing capabilities. In essence, it means that Blossom Sky processes data from various sources, brings the computed intermediate results to a central location, and then merges it into a consolidated summary table.

The intermediate aggregated results are shipped to the location that Blossom Sky is running

Method 2 - Local Data Federation

Blossom Sky also supports location-based data federation. The intermediate aggregated results, which were generated in New York City (NYC), are transmitted to the other location, Paris. In Paris, these results are combined or merged together. This process involves integrating the intermediate data from NYC with other data present in Paris, resulting in a consolidated dataset or outcome. In simpler terms, it means that computed intermediate results from NYC are sent to Paris, where they are combined with other data to create a merged dataset. This method implies that the merged data is further processed in Paris, fully GDPR compliant.

Decentral data merge with Blossom Sky

Or the intermediate aggregated results, which have been processed in Paris, are sent to New York City (NYC) for merging. This means that the data computed in Paris is transferred to NYC, where it is combined into a single summary table. This approach allows for data consolidation without requiring raw data to leave its initial location, ensuring data privacy and compliance. This implies that the intermediate results are further processed in NYC, eventually to build a holistic view about market conditions in certain economic areas.

Decentral data federation with Blossom Sky

In all scenarios, Blossom Sky prioritizes data security and privacy by ensuring that raw data remains within its original location, adhering to strict compliance regulations. Unlike other solutions, Blossom Sky does not require the deployment of third-party execution engines on the data pools, simplifying the data management process. With Blossom Sky's advanced capabilities, organizations have the flexibility to choose the approach that suits their specific needs. Whether they opt for data aggregation at the platform where Blossom Sky operates, at the source platform, or even let Blossom's AI optimizer make the decision, Blossom Sky empowers users to effortlessly navigate complex data scenarios while maintaining the highest data security standards.

How Does Blossom Sky Ensures Data Access Controls?

We have a scenario where we need to ensure that only specific team members have access to sensitive financial data while others should be restricted. How does Blossom Sky handle such stringent data access control requirements?

One of the most frequently asked questions is how Blossom Sky ensures stringent data access controls. Our platform operates on the principle of tight access delegation, where each user is granted access to specific data tables, mirroring the level of control available within your organization. We require user authentication only for our studio, Blossom Studio, enabling users to create working groups and manage access efficiently. This approach ensures that data access remains secure and controlled, minimizing the risk of unauthorized access and breaches while maintaining an intuitive and streamlined user experience. In practice, this means that the user executing a federated job must have access to the data sources included in their query, a process typically managed internally within organizations.

Do we have a holistic view of all datasets using Blossom or do we need a Master Data Management (MDM) type layer?

Blossom Sky offers the capability to connect with multiple data pools and platforms, making it unnecessary to implement a separate Master Data Management (MDM) layer only for Blossom Sky. Our platform seamlessly integrates with existing data management systems, serving as a versatile and compliant solution for executing data pipelines, streamlining data operations, and ensuring data consistency across the organization.

How Does Blossom Sky's AI Optimizer Enhances Data Processing Efficiency?

We've been facing performance and reliability issues with our Spark and SQL instances. How does Blossom Sky's AI optimizer address these challenges, and can you share a specific case where it improved data processing efficiency for an organization with similar issues?

In scenarios where users may inadvertently make suboptimal decisions, Blossom Sky addresses performance concerns by leveraging its AI optimizer. For instance, in the example above, when the merge operation includes a lot of intermediate results, doing so in a third location and using the Java platform as a plugin could result in very long processing times or even memory exceptions. Blossom Sky comes with an AI optimizer to make the best decisions on where operations should take place taking into consideration either the runtime and/or the monetary costs involved. Below we see an example of a classification task where the optimizer of Blossom decided on a plan that outperforms the single Java and Spark plugins by more than an order of magnitude.

Blossom Sky's AI Optimizer Enhances Data Processing Efficiency

How Much Effort is Needed To Start With Blossom Sky?

What programming options are available with Blossom Sky, and how user-friendly is the platform for our data team, especially if they have experience with tools like Apache Spark?

Blossom supports standard SQL, which makes it convenient to write analytical pipelines. Additionally, it comes with three programmatic APIs: a Java scala-like API, a Scala API, and an SQL API. A Python API is on its way too! Writing pipelines from scratch involves a small learning curve, but not for those  familiar with big data platforms such as Apache Spark. Blossom Sky also comes with Blossom Studio, where users can drag and drop operators to build their pipelines with a low code effort.

Blossom Studio, the low code Federated Learning Platform

About DataBloom

Blossom Sky is all about taking data collaboration and efficiency to the next level. Our platform tackles the big challenge of data silos, bringing everything together in one easy-to-use system. It's built to work smoothly with a whole range of AI algorithms and models.

The cool part? Blossom Sky works hand-in-hand with top data frameworks like Databricks, Snowflake, Cloudera, and others, including Hadoop, Teradata, and Oracle. Plus, it's fully compatible with AI favorites like TensorFlow, Pandas, and PyTorch. We've made sure it fits right into your existing setup.

Blossom Sky is the commercial version of Apache Wayang, and we're proud to offer it as Open Source Software. You can check out our public GitHub repo right here. If you're enjoying our software, we'd love your support - a star ⭐ would mean a lot to us!

If you need professional support from our team of industry leading experts, you can always reach out to us via Slack or Email.
back to all articlesFollow us on Google News
Ready to Elevate Your Data Experience? Get a quote today!