A Generalizable Framework for Building Executable Domain-Specific LLMs under Data Scarcity: Demonstration on Semiconductor TCAD Simulation

Di Wang; Zhenhua Wu; Yu Liu; Kai Chang; Shaohua Wu

arXiv:2601.10128·cs.CE·January 16, 2026

A Generalizable Framework for Building Executable Domain-Specific LLMs under Data Scarcity: Demonstration on Semiconductor TCAD Simulation

Di Wang, Zhenhua Wu, Yu Liu, Kai Chang, Shaohua Wu

PDF

Open Access

TL;DR

This paper introduces a schema-first alignment framework for creating compact, executable domain-specific large language models in low-resource settings, demonstrated on semiconductor TCAD simulation, achieving high accuracy and outperforming general LLMs.

Contribution

The paper presents a novel framework combining synthetic data generation, IR-based workflow, and evaluation strategies to build executable domain-specific LLMs with limited data.

Findings

01

TcadGPT achieves 85.6% semantic accuracy.

02

TcadGPT attains 80.0% syntax pass rate on executability tests.

03

Framework improves script success rates in other domains like Elmer.

Abstract

Scientific and engineering verticals often suffer from data scarcity and strict executability requirements: models must generate not only fluent text, but also syntactically valid, tool-compilable scripts. We present a schema-first alignment framework for building compact, executable domain-specific LLMs in low-resource settings. The framework integrates three core components: (i) large-scale synthetic QA data generation from expert documentation to instill foundational domain knowledge; (ii) a code-centric IR->DPO workflow that converts verified tool decks into interpretable intermediate representations (IR), performs equivalence-preserving diversification, and constructs preference pairs to directly optimize instruction compliance and code executability; and (iii) a controlled evaluation of Retrieval-Augmented Generation (RAG), showing that while RAG benefits general LLMs, it can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Scientific Computing and Data Management · Machine Learning in Materials Science