Relatron: Automating Relational Machine Learning over Relational Databases

Zhikai Chen; Han Xie; Jian Zhang; Jiliang Tang; Xiang Song; Huzefa Rangwala

arXiv:2602.22552·cs.LG·February 27, 2026

Relatron: Automating Relational Machine Learning over Relational Databases

Zhikai Chen, Han Xie, Jian Zhang, Jiliang Tang, Xiang Song, Huzefa Rangwala

PDF

Open Access 3 Reviews

TL;DR

Relatron introduces a task-aware meta-selector that intelligently chooses between relational deep learning and classical feature synthesis, significantly improving predictive performance over diverse relational database tasks with reduced tuning costs.

Contribution

The paper unifies RDL and DFS in a shared design space, develops a performance bank, and proposes Relatron, a meta-selector leveraging task signals for improved architecture selection.

Findings

01

RDL does not always outperform DFS; performance depends on the task.

02

No single architecture is best across all tasks, highlighting the need for task-aware selection.

03

Validation accuracy alone is unreliable for choosing architectures.

Abstract

Predictive modeling over relational databases (RDBs) powers applications, yet remains challenging due to capturing both cross-table dependencies and complex feature interactions. Relational Deep Learning (RDL) methods automate feature engineering via message passing, while classical approaches like Deep Feature Synthesis (DFS) rely on predefined non-parametric aggregators. Despite performance gains, the comparative advantages of RDL over DFS and the design principles for selecting effective architectures remain poorly understood. We present a comprehensive study that unifies RDL and DFS in a shared design space and conducts architecture-centric searches across diverse RDB tasks. Our analysis yields three key findings: (1) RDL does not consistently outperform DFS, with performance being highly task-dependent; (2) no single architecture dominates across tasks, underscoring the need for…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

The architecture search (180 configs/task for entity-level, 20 for DFS) is comprehensive. The findings are interesting as it show DFS wins on more tasks , attributing gaps to homophily.

Weaknesses

1. The scope of the paper is limited, as it focuses on from-scratch models and defers foundation models(e.g., Griffin, KumoRFM) despite comparisons (Fig. 2). It would be interesting to see how Relatron can handle pretrained relational foundation models. 2. The experiment dataset is limited, only covering most of RelBench and two additional tasks. It would be interesting to see the performance across multiple datasets. 3. The correlation between homophily and RDL-DFS performance gains are stron

Reviewer 02Rating 6Confidence 3

Strengths

1. Extensive experiments over various baselines and datasets. Figure 2, Table 2-4, shows various experiments. 2. Insightful observation for model selection. To me, “Correlation between homophily and RDL-DFS performance gap" is the most interesting finding. Though homophily is a common tool in graph learning, it is first used in RDB datasets to our best knowledge.

Weaknesses

1. Griffin cited in this work should also be included as one important baseline on RDB tasks. 2. The comparison between these auto ml framework and vanilla baseline is missing. If this framework leads to large computation overhead, then vanilla baseline may be prefered in real-world application.

Reviewer 03Rating 8Confidence 4

Strengths

* The paper is thorough in its parameterization of the design space. * The findings are interesting. I like the experiment and finding that validation metrics are unreliable as this is very important for temporal splits as in leading relational benchmarks. * The metrics proposed for "task embeddings" are simple, clear, clever, and have high predictive power. * The baseline comparisons are comprehensive. * The paper is well-written overall.

Weaknesses

I think it is important to include intuitions and analysis (theoretical as well as empirical) for why RDL is better at low-homophily tasks and DFS is better at high-homophily tasks in the main paper. Currently it is in Appendix C.2, but it is too long (4 pages!) and it is not clear what the intuitive takeaways are. It would be nice to have a discussion of the main intuitions and takeaways in the main paper. It would be ideal if the theory can be substantiated with some experiments.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Advanced Graph Neural Networks · Machine Learning and Data Classification