Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data
Thomas R\"uckstie{\ss}, Robin Vujanic

TL;DR
ORiGAMi is an autoregressive transformer that synthesizes semi-structured JSON data directly, outperforming flattened approaches in fidelity, utility, and privacy across diverse datasets.
Contribution
It introduces a novel architecture that models JSON records natively without flattening, maintaining structure and achieving state-of-the-art results.
Findings
ORiGAMi outperforms baselines in 17 of 18 benchmarks.
Achieves privacy scores above 96% across datasets.
Maintains high fidelity, detection, and utility metrics.
Abstract
Synthetic data generation is an important capability for privacy-preserving data sharing, system benchmarking and test data provisioning. For mixed-type data, existing synthesizers largely target dense, fixed-schema tables, but many modern data systems store and exchange sparse, semi-structured JSON with nested objects, variable-length arrays and optional keys. Applying tabular synthesizers to such data requires flattening records into wide, sparse tables, turning nested structure and arrays into column-layout artifacts. We present ORiGAMi, an autoregressive transformer architecture for modeling and synthesizing semi-structured records without flattening. ORiGAMi serializes JSON records into key, value, and structural tokens, and encodes token positions by their path in the document tree. Grammar and schema constraints enforce syntactically valid JSON and dataset-consistent structure.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
