RelDiff: Relational Data Generative Modeling with Graph-Based Diffusion Models
Valter Hudovernik, Minkai Xu, Juntong Shi, Lovro \v{S}ubelj, Stefano Ermon, Erik \v{S}trumbelj, Jure Leskovec

TL;DR
RelDiff is a new diffusion-based generative model that synthesizes complete relational databases by explicitly modeling their graph structure, improving realism and structural integrity over previous methods.
Contribution
It introduces a novel diffusion model that explicitly captures relational graph structures for generating synthetic databases, addressing limitations of prior flat-table approaches.
Findings
Outperforms prior methods on 11 benchmark datasets
Produces more realistic and coherent synthetic relational data
Maintains high fidelity and referential integrity
Abstract
Real-world databases are predominantly relational, comprising multiple interlinked tables that contain complex structural and statistical dependencies. Learning generative models on relational data has shown great promise in generating synthetic data and imputing missing values. However, existing methods often struggle to capture this complexity, typically reducing relational data to conditionally generated flat tables and imposing limiting structural assumptions. To address these limitations, we introduce RelDiff, a novel diffusion generative model that synthesizes complete relational databases by explicitly modeling their foreign key graph structure. RelDiff combines a joint graph-conditioned diffusion process across all tables for attribute synthesis, and a SBM graph generator based on the Stochastic Block Model for structure generation. The decomposition of graph structure and…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The modeling choices are well-motivated: microcanonical SBM gives hard constraints for referential integrity; hybrid diffusion aligns with mixed continuous/categorical columns; heterogeneous GNNs with subgraph sampling are a sensible scalability strategy. 2. Joint graph-conditioned diffusion over the entire entity graph, coupled with a microcanonical, nested SBM to preserve relational cardinalities and hierarchy, is a clean and compelling synthesis.
1. The baselines omit recent joint modeling approaches like GRDM (Graph-Conditional Relational Diffusion Model), which also performs joint denoising over relational graphs and reports strong k-hop performance. The paper positions prior work mainly as sequential/conditional (ClavaDDPM, etc.), but the landscape now includes joint graph-conditioned diffusion and flow-matching variants. 2. The nested SBM is well-motivated for modular schemas, but the paper preprocesses two-parent/no-child tables to
**Quality**. The paper uses a combination of different techniques. First, it generates a graph via their D2K + SBM generator. Their generator is comprised of Bayesian SBM as a model of graphs + D2K graph generator to preserve relationships between nodes. Subsequently, they define a conditional hybrid diffusion process which generates categorical and numerical samples conditioned on the generated graph. **Clarity**. Paper is easy to follow. **Significance**. Paper looks at tabular data generati
Overall, experiments and ablation studies are comprehensive, comprising of performance, runtime, computation and privacy. However, a concern I have is its novelty. Its a combination of existing well-known methods which I believe for the current standards of conferences like NeurIPS, ICLR and ICML, it may be insufficient. The main takeaway that the framework provides is that integrating graph based generators into diffusion models help provide extra signal to improve generative performance.
1.The paper is generally well-written and easy to follow. 2.The use of the D2K + SBM graph generator to preserve foreign key cardinality and hierarchical dependencies is novel and technically interesting. 3.The ablation study is comprehensive.
1. The decomposition $p(\mathcal{V},\mathcal{E})$ = $p(\mathcal{E})p(\mathcal{V}|\mathcal{E})$ is assumed without theoretical support. 2. The proposed joint diffusion model is not clearly novel compared with existing tabular diffusion approaches such as TabDDPM, TABSYN, and TabDiff. 3. The high training cost of RelDiff raises scalability concerns, and memory usage is not reported.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Network Analysis Techniques · Bioinformatics and Genomic Networks · Graph Theory and Algorithms
