A Generalizable Framework for Building Executable Domain-Specific LLMs under Data Scarcity: Demonstration on Semiconductor TCAD Simulation
Di Wang, Zhenhua Wu, Yu Liu, Kai Chang, Shaohua Wu

TL;DR
This paper introduces a schema-first alignment framework for creating compact, executable domain-specific large language models in low-resource settings, demonstrated on semiconductor TCAD simulation, achieving high accuracy and outperforming general LLMs.
Contribution
The paper presents a novel framework combining synthetic data generation, IR-based workflow, and evaluation strategies to build executable domain-specific LLMs with limited data.
Findings
TcadGPT achieves 85.6% semantic accuracy.
TcadGPT attains 80.0% syntax pass rate on executability tests.
Framework improves script success rates in other domains like Elmer.
Abstract
Scientific and engineering verticals often suffer from data scarcity and strict executability requirements: models must generate not only fluent text, but also syntactically valid, tool-compilable scripts. We present a schema-first alignment framework for building compact, executable domain-specific LLMs in low-resource settings. The framework integrates three core components: (i) large-scale synthetic QA data generation from expert documentation to instill foundational domain knowledge; (ii) a code-centric IR->DPO workflow that converts verified tool decks into interpretable intermediate representations (IR), performs equivalence-preserving diversification, and constructs preference pairs to directly optimize instruction compliance and code executability; and (iii) a controlled evaluation of Retrieval-Augmented Generation (RAG), showing that while RAG benefits general LLMs, it can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Scientific Computing and Data Management · Machine Learning in Materials Science
