LLMSynthor: Macro-Aligned Micro-Records Synthesis with Large Language Models
Yihong Tang, Menglin Kong, Junlin He, Tong Nie, Lijun Sun

TL;DR
LLMSynthor leverages large language models to generate realistic micro-level data aligned with macro-statistics, enabling credible simulations in social sciences and urban studies without extensive data collection.
Contribution
This work introduces a macro-aware LLM-based framework that iteratively synthesizes micro-records matching target macro-statistics, using a nonparametric copula approach and proposal sampling for efficiency.
Findings
Achieves high realism and statistical fidelity in synthetic data across multiple domains.
Effectively captures joint dependencies among variables in generated micro-records.
Demonstrates broad applicability to economics, social science, and urban studies.
Abstract
Macro-aligned micro-records are crucial for credible simulations in social science and urban studies. For example, epidemic models are only reliable when individual-level mobility and contacts mirror real behavior, while aggregates match real-world statistics like case counts or travel flows. However, collecting such fine-grained data at scale is impractical, leaving researchers with only macro-level data. LLMSynthor addresses this by turning a pretrained LLM into a macro-aware simulator that generates realistic micro-records consistent with target macro-statistics. It iteratively builds synthetic datasets: in each step, the LLM generates batches of records to minimize discrepancies between synthetic and target aggregates. Treating the LLM as a nonparametric copula allows the model to capture realistic joint dependencies among variables. To improve efficiency, LLM Proposal Sampling…
Peer Reviews
Decision·Submitted to ICLR 2026
- This paper addresses a practical challenge in data synthesis: generating individual-level records when only aggregate statistics are available. - The experimental setup on three different applications demonstrate broad applicability and outperforms the considered baselines.
- How do you prevent "over-correction", where repeated discrepancy-guided sampling causes loss of diversity? - Method limitations. In real world applications, macro-statistics might be noisy or incomplete, can the method incorporate uncertainty over the target aggregates? - Limited empirical evaluation. How sensitive is the framework to LLM quality?
- The paper is very well written with illustrative figures. - Synthetic data generation with LLMs is an important and timely problem. - Method design feels practical. The loop is simple and likely easy to implement. - Some limitations of LLMs are clearly discussed.
- Evaluation alignment vs. leakage. In section 4.2, it is not fully clear which macro-stats are used as inputs to LLMSynthor. Are they the same as evaluation metrics? If they coincide, the method could look stronger than baselines because it is explicitly optimizing those statistics, whereas baselines also model micro-level realism beyond the chosen stats. - Baselines breadth & recency. The population-synthesis baselines (CP/HMM/NVI) appear to be outdated relative to modern diffusion/tabular-LL
Synthesizing datasets using LLM is a good topic.
- The paper lacks a description of important details, such as performance metric computation and the proposal for the micro-records sampling step. In Table 2, it is unclear what the Jsd column means. For other specified metrics, it is unclear how they are computed and why those metrics make sense in this application. - Figure 11 appears to suggest that LLM-based generation still scales poorly with dataset size, potentially far slower than purely generative baselines like GReaT, which trains onc
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Mobility and Location-Based Analysis · COVID-19 epidemiological studies · Complex Network Analysis Techniques
