CircuitSynth: Reliable Synthetic Data Generation
Zehua Cheng, Wei Dai, Jiahao Sun, Thomas Lukasiewicz

TL;DR
CircuitSynth is a neuro-symbolic framework that generates reliable synthetic data by combining semantic reasoning with surface realization, ensuring validity and coverage in complex tasks.
Contribution
It introduces a novel approach that distills LLM reasoning into a probabilistic structure and enforces logical constraints through convex optimization.
Findings
Achieves 100% schema validity in complex logic puzzles.
Outperforms state-of-the-art methods in rare-combination coverage.
Significantly reduces logical inconsistencies in synthetic data.
Abstract
The generation of high-fidelity synthetic data is a cornerstone of modern machine learning, yet Large Language Models (LLMs) frequently suffer from hallucinations, logical inconsistencies, and mode collapse when tasked with structured generation. Existing approaches, such as prompting or retrieval-augmented generation, lack the mechanisms to balance linguistic expressivity with formal guarantees regarding validity and coverage. To address this, we propose CircuitSynth, a novel neuro-symbolic framework that decouples semantic reasoning from surface realization. By distilling the reasoning capabilities of a Teacher LLM into a Probabilistic Sentential Decision Diagram (PSDD), CircuitSynth creates a tractable semantic prior that structurally enforces hard logical constraints. Furthermore, we introduce a convex optimization mechanism to rigorously satisfy soft distributional goals. Empirical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
