How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation

Xin Lu; Yanyan Zhao; Si Wei; Shijin Wang; Bing Qin; Ting Liu

arXiv:2505.18522·cs.CL·October 27, 2025

How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation

Xin Lu, Yanyan Zhao, Si Wei, Shijin Wang, Bing Qin, Ting Liu

PDF

Open Access

TL;DR

This paper investigates how sequence modeling architecture choices influence the fundamental capabilities of pre-trained language models, revealing that architectures with full-sequence arbitrary selection capabilities better preserve base capabilities.

Contribution

It introduces a limited domain pre-training setting with out-of-distribution testing to better compare architectures and proposes a key design principle emphasizing full-sequence arbitrary selection to avoid capability degradation.

Findings

01

Stateful architectures show significant degradation in base capabilities.

02

Full-sequence arbitrary selection capability is crucial for maintaining base capabilities.

03

Simple Top-1 and chunk selection architectures validate the proposed design principle.

Abstract

Pre-trained language models represented by the Transformer have been proven to possess strong base capabilities, and the representative self-attention mechanism in the Transformer has become a classic in sequence modeling architectures. Different from the work of proposing sequence modeling architecture to improve the efficiency of attention mechanism, this work focuses on the impact of sequence modeling architectures on base capabilities. Specifically, our concern is: How exactly do sequence modeling architectures affect the base capabilities of pre-trained language models? In this work, we first point out that the mixed domain pre-training setting commonly adopted in existing architecture design works fails to adequately reveal the differences in base capabilities among various architectures. To address this, we propose a limited domain pre-training setting with out-of-distribution…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Balanced Selection