RumbleML: program the lakehouse with JSONiq
Ghislain Fourny, David Dao, Can Berker Cikis, Ce Zhang and, Gustavo Alonso

TL;DR
RumbleML introduces a declarative, JSONiq-based library integrated into RumbleDB, enabling seamless data processing and machine learning tasks in lakehouse systems without performance loss.
Contribution
It presents the first prototype of a JSONiq-based system for unified data and machine learning workflows in lakehouses, demonstrating significant productivity and functionality improvements.
Findings
Comparable performance to Spark
Enhanced data cleaning and normalization capabilities
Unified language for data and ML tasks
Abstract
Lakehouse systems have reached in the past few years unprecedented size and heterogeneity and have been embraced by many industry players. However, they are often difficult to use as they lack the declarative language and optimization possibilities of relational engines. This paper introduces RumbleML, a high-level, declarative library integrated into the RumbleDB engine and with the JSONiq language. RumbleML allows using a single platform for data cleaning, data preparation, training, and inference, as well as management of models and results. It does it using a purely declarative language (JSONiq) for all these tasks and without any performance loss over existing platforms (e.g. Spark). The key insights of the design of RumbleML are that training sets, evaluation sets, and test sets can be represented as homogeneous sequences of flat objects; that models can be seamlessly embodied in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTime Series Analysis and Forecasting · Data Stream Mining Techniques · Anomaly Detection Techniques and Applications
