Efficiently Building a Domain-Specific Large Language Model from Scratch: A Case Study of a Classical Chinese Large Language Model
Shen Li, Renfen Hu, Lijun Wang

TL;DR
This paper presents the development of AI Taiyan, a 1.8-billion-parameter large language model tailored for Classical Chinese, demonstrating superior performance in domain-specific tasks over general models and traditional methods.
Contribution
It introduces a practical approach for building effective domain-specific large language models from scratch with limited parameters and data, specifically for Classical Chinese.
Findings
AI Taiyan outperforms general models in Classical Chinese tasks
Achieves near or above human performance in key language processing tasks
Provides a scalable framework for domain-specific LLM development
Abstract
General-purpose large language models demonstrate notable capabilities in language comprehension and generation, achieving results that are comparable to, or even surpass, human performance in many natural language processing tasks. Nevertheless, when general models are applied to some specific domains, e.g., Classical Chinese texts, their effectiveness is often unsatisfactory, and fine-tuning open-source foundational models similarly struggles to adequately incorporate domain-specific knowledge. To address this challenge, this study developed a large language model, AI Taiyan, specifically designed for understanding and generating Classical Chinese. Experiments show that with a reasonable model design, data processing, foundational training, and fine-tuning, satisfactory results can be achieved with only 1.8 billion parameters. In key tasks related to language processing of Classical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
