Chain-of-Model Learning for Language Model

Kaitao Song; Xiaohua Wang; Xu Tan; Huiqiang Jiang; Chengruidong Zhang; Yongliang Shen; Cen LU; Zihao Li; Zifan Song; Caihua Shan; Yansen Wang; Kan Ren; Xiaoqing Zheng; Tao Qin; Yuqing Yang; Dongsheng Li; Lili Qiu

arXiv:2505.11820·cs.CL·May 26, 2025

Chain-of-Model Learning for Language Model

Kaitao Song, Xiaohua Wang, Xu Tan, Huiqiang Jiang, Chengruidong Zhang, Yongliang Shen, Cen LU, Zihao Li, Zifan Song, Caihua Shan, Yansen Wang, Kan Ren, Xiaoqing Zheng, Tao Qin, Yuqing Yang, Dongsheng Li, Lili Qiu

PDF

Open Access 1 Video

TL;DR

This paper introduces Chain-of-Model (CoM), a new learning paradigm for language models that enhances scaling efficiency and inference flexibility by structuring hidden states as chains of sub-representations, enabling progressive scaling and elastic inference.

Contribution

The paper proposes the CoM framework and the CoLM model, integrating chain-based representations into Transformer layers for scalable, flexible language modeling with shared key-value mechanisms.

Findings

01

Achieves comparable performance to standard Transformers.

02

Enables progressive scaling for training efficiency.

03

Supports elastic inference with multiple model sizes.

Abstract

In this paper, we propose a novel learning paradigm, termed Chain-of-Model (CoM), which incorporates the causal relationship into the hidden states of each layer as a chain style, thereby introducing great scaling efficiency in model training and inference flexibility in deployment. We introduce the concept of Chain-of-Representation (CoR), which formulates the hidden states at each layer as a combination of multiple sub-representations (i.e., chains) at the hidden dimension level. In each layer, each chain from the output representations can only view all of its preceding chains in the input representations. Consequently, the model built upon CoM framework can progressively scale up the model size by increasing the chains based on the previous models (i.e., chains), and offer multiple sub-models at varying sizes for elastic inference by using different chain numbers. Based on this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Chain-of-Model Learning for Language Model· slideslive

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Topic Modeling · Domain Adaptation and Few-Shot Learning

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Softmax · Position-Wise Feed-Forward Layer