DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning
Weize Liu, Yongchi Zhao, Yijia Luo, Mingyu Xu, Jiaheng Liu, Yanan Li, Xiguo Hu, Zhiqi Bai, Yuchi Xu, Wenbo Su, Bo Zheng

TL;DR
The paper introduces DESIGNER, a data synthesis pipeline that uses Design Logic to generate large, diverse, and complex multidisciplinary reasoning questions for LLMs, significantly enhancing their reasoning capabilities.
Contribution
It proposes a novel Design Logic framework and a two-stage retrieve-and-generate method to synthesize large-scale, diverse reasoning datasets across 75 disciplines, improving LLM reasoning performance.
Findings
Synthesized datasets contain more difficult and diverse questions.
Supervised fine-tuning on synthesized data improves LLM reasoning.
Fine-tuned models outperform baseline and even official final models.
Abstract
Large language models (LLMs) perform strongly on many language tasks but still struggle with complex multi-step reasoning across disciplines. Existing reasoning datasets often lack disciplinary breadth, reasoning depth, and diversity, as well as guiding principles for question synthesis. We propose DESIGNER: a DESIGN-logic-guidEd Reasoning data synthesis pipeline that leverages naturally available, extensive raw documents to generate multidisciplinary questions. The central insight is the notion of Design Logic, a form of reusable meta-knowledge that encapsulates the structured process human experts use to transform knowledge into complex exam questions, enabling LLMs to generate new questions with the same complex reasoning patterns from entirely different source texts with explicit control over difficulty, diversity, and question types. We use LLMs to reverse-engineer and abstract…
Peer Reviews
Decision·ICLR 2026 Poster
- Design logic direction seems like a novel and scalable method for condensing multi-domain learnings in a reusable manner. - Resultant datasets yields solid improvements over other synthetic pre-training data methods. - Paper is well written and easy to follow.
My biggest concerns are with the use of a proprietary question bank, and also general concerns with the impact of the design logics, versus just having relevant and high-quality data: - For the design logic process, I like the idea of having a static design logics bank, however, it seems the design logics themselves don't have a massive impact on performance; the 'w/o Design Logic' ablations note fairly small gains from using design logics (as opposed to just providing examples from the question
1. The proposed **DESIGNER** data synthesis pipeline is novel and well-motivated. The structured process human educators use to construct complex and insightful questions. 1. Using this pipeline, the authors created two new, large-scale reasoning datasets: **DLR-Book** (3.04 million questions) and **DLR-Web** (1.66 million questions), which benefit the community to improve existing models. 1. Detailed analyses prove with quantitative metrics that the DLR datasets are more difficult and more sema
1. The pipeline's success relies on massive proprietary assets. Though the authors state they will release a subset of the final synthesized data, this still limits the community's ability to fully reproduce the pipeline or build upon the design logic library. 1. The pipeline itself depends on existing very large, capable models (e.g., Qwen3-30B, DeepSeek-R1-0528), and this means massive computational resources are required to apply this pipeline. 1. The pipeline only creates large-scale dataset
The step-by-step pipeline is highly systematic, enabling large-scale query data generation with strong operability and reproducibility. The clear phases—from data curation and design logic extraction to question synthesis—provide a structured framework that can be easily adapted or extended by other researchers. This practicality enhances the method's value for real-world applications.
1) The work resembles a complex engineering effort with multiple sub-tasks (e.g., discipline labeling, difficulty classification) pieced together, which may lack the novelty. 2) The pipeline relies heavily on prompt engineering (PE) at multiple stages (e.g., discipline classification using Figure 7, difficulty classification using Figure 8), but lacks rigorous quality assessments for each step. For instance, the discipline labels and difficulty scores are derived from LLM judgments without vali
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
