SOAEsV2-7B/72B: Full-Pipeline Optimization for State-Owned Enterprise   LLMs via Continual Pre-Training, Domain-Progressive SFT and   Distillation-Enhanced Speculative Decoding

Jingyang Deng; Ran Chen; Jo-Ku Cheng; Jinwen Ma

arXiv:2505.04723·cs.CL·May 9, 2025

SOAEsV2-7B/72B: Full-Pipeline Optimization for State-Owned Enterprise LLMs via Continual Pre-Training, Domain-Progressive SFT and Distillation-Enhanced Speculative Decoding

Jingyang Deng, Ran Chen, Jo-Ku Cheng, Jinwen Ma

PDF

Open Access

TL;DR

This paper presents SOAEsV2, a comprehensive pipeline for optimizing large language models for Chinese state-owned enterprises, combining continual pre-training, domain-progressive fine-tuning, and accelerated inference techniques.

Contribution

It introduces a full-pipeline framework that enhances domain-specific LLM performance while maintaining general capabilities and improving inference speed.

Findings

01

Maintains 99.8% of original language capabilities after domain pre-training.

02

Achieves over 1.08x improvement in Rouge-1 score.

03

Speeds up inference by 1.39-1.52x with no quality loss.

Abstract

This study addresses key challenges in developing domain-specific large language models (LLMs) for Chinese state-owned assets and enterprises (SOAEs), where current approaches face three limitations: 1) constrained model capacity that limits knowledge integration and cross-task adaptability; 2) excessive reliance on domain-specific supervised fine-tuning (SFT) data, which neglects the broader applicability of general language patterns; and 3) inefficient inference acceleration for large models processing long contexts. In this work, we propose SOAEsV2-7B/72B, a specialized LLM series developed via a three-phase framework: 1) continual pre-training integrates domain knowledge while retaining base capabilities; 2) domain-progressive SFT employs curriculum-based learning strategy, transitioning from weakly relevant conversational data to expert-annotated SOAEs datasets to optimize…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks

MethodsShrink and Fine-Tune · Balanced Selection