JoinBoost: Grow Trees Over Normalized Data Using Only SQL
Zezhou Huang, Rathijit Sen, Jiaxiang Liu, Eugene Wu

TL;DR
JoinBoost is a SQL-based library that enables scalable, high-performance tree model training directly within relational databases, eliminating the need for data denormalization and external ML tools.
Contribution
It introduces a portable, SQL-only approach for tree training over normalized data, extending algorithms and system optimizations for better performance and scalability.
Findings
JoinBoost achieves 3x faster training than LightGBM for random forests.
It outperforms state-of-the-art In-DB ML systems by over an order of magnitude.
The system scales well with increasing data size, features, and schema complexity.
Abstract
Although dominant for tabular data, ML libraries that train tree models over normalized databases (e.g., LightGBM, XGBoost) require the data to be denormalized as a single table, materialized, and exported. This process is not scalable, slow, and poses security risks. In-DB ML aims to train models within DBMSes to avoid data movement and provide data governance. Rather than modify a DBMS to support In-DB ML, is it possible to offer competitive tree training performance to specialized ML libraries...with only SQL? We present JoinBoost, a Python library that rewrites tree training algorithms over normalized databases into pure SQL. It is portable to any DBMS, offers performance competitive with specialized ML libraries, and scales with the underlying DBMS capabilities. JoinBoost extends prior work from both algorithmic and systems perspectives. Algorithmically, we support factorized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data Mining Algorithms and Applications · Machine Learning and Data Classification
MethodsLib
