Browse our library; all papers are in chronological order, newest first. Bookmark this page to stay up to date on our latest articles and research. Another excellent resource is our blog, where our experts discuss the most recent advancements in AI, ML, and data processing.
P2A: Framework for Optimizing Data Science Pipelines
Our approach allows to identify DBMS-supported operations and translate them into SQL to leverage DBMSes for accelerating data science workloads. The optimization target is twofold: First, to improve data loading, by reducing the amount of data to be transferred between runtimes.
Artificial intelligence to advance Earth observation
Earth observation (EO) is a prime instrument for monitoring land and ocean processes, studying the dynamics at work, and taking the pulse of our planet. This article gives a bird's eye view of the essential scientific tools and approaches informing and supporting the transition from raw EO data to usable EO-based information.
Navigating Compliance in Federated Data Processing
The processing of geo-distributed data is subject to data transfer regulations. In this paper, we present our work on a federated data processing system that can comply with these regulations. We also present research challenges and opportunities for the system to make compliance truly first-class citizens.
ML-based Cross-Platform Query Optimization
Cost-based optimization is widely known to suffer from a major weakness: administrators spend a significant amount of time to tune the associated cost models. This problem only gets exacerbated in cross-platform settings as there are many more parameters that need to be tuned.
RHEEMix in the Data Jungle
Data analytics are moving beyond the limits of a single platform. In this paper, we present the cost-based optimizer of Rheem, an open-source crossplatform system that copes with these new requirements.
Optimizing Cross-Platform Data Movements
Data analytics are moving beyond the limits of a single data processing platform. A cross-platform query optimizer is necessary to enable applications to run their tasks over multiple platforms efficiently and in a platform-agnostic manner.
Simplified Big Data Debugging for Dataflow Jobs
Although big data processing has become dramatically easier over the last decade, there has not been matching progress over big data debugging. It is estimated that users spend more than 50% of their time debugging their big data applications.
Enabling Cross-Platform Data Processing
Solving business problems increasingly requires going beyond the limits of a single data processing platform (platform for short), such as Hadoop or a DBMS. As a result,organizations typically perform tedious and costly tasks to juggle their code and data across different platforms.
Cross-Platform Data Analytics Made Easy
Many of today’s applications need several data processing platforms for complex analytics. Thus, recent systems have taken steps towards supporting cross-platform data analytics. Yet, current cross-platform systems lack of ease-of-use, which is crucial for their adoption.
Challenges of Cross-Platform Data Processing
There is a zoo of data processing platforms which help users and organizations to extract value out of their data. Although each of these platforms excels in specific aspects, users typically end up running their data analytics on suboptimal platforms.
Building your Cross-Platform Application with RHEEM
Today, organizations typically perform tedious and costly tasks to juggle their code and data across different data processing platforms. Addressing this pain and achieving automatic cross-platform data processing is quite challenging because it requires quite good expertise for all the available data processing platforms.
Fast and scalable inequality joins
Inequality joins, which is to join relations with inequality conditions, are used in various applications. Optimizing joins has been the subject of intensive research ranging from efficient join algorithms such assort-merge join, to the use of efficient indices such as B+-tree, R∗-tree and Bitmap.
A Cost-based Optimizer for Gradient Descent Optimization
As the use of machine learning (ML) permeates into diverse application domains, there is an urgent need to support a declarative framework for ML. Ideally, a user will specify anML task in a high-level and easy-to-use language and the framework will invoke the appropriate algorithms and system configurations to execute it.
Road to Freedom in Big Data Analytics
The world is fast moving towards a data-driven society where data is the most valuable asset. Organizations need to perform very diverse analytic tasks using various data processing platforms. In Doing so, they face many challenges; chiefly, platform dependence,poor interoperability, and poor performance when using multiple platforms.
Enabling Multi-Platform Task Execution
Many emerging applications, from domains such as healthcare and oil & gas, require several data processing system for complex analytics. This demo paper showcases Rheem, a framework that provides multi-platform task execution for such applications
BigDansing: A System for Big Data Cleansing
Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions.
Lightning Fast and Space Efficient Inequality Joins
Inequality joins, which join relational tables on inequality conditions, are used in various applications. While there have been a wide range of optimization methods for joins in database systems, from algorithms such as sort-merge join and band join, to various indices such as B+-tree, R∗-tree and Bitmap.