Sigma: Differential Rescaling of Query, Key and Value for Efficient   Language Models

Zhenghao Lin; Zihao Tang; Xiao Liu; Yeyun Gong; Yi Cheng; Qi Chen,; Hang Li; Ying Xin; Ziyue Yang; Kailai Yang; Yu Yan; Xiao Liang; Shuai Lu,; Yiming Huang; Zheheng Luo; Lei Qu; Xuan Feng; Yaoxiang Wang; Yuqing Xia,; Feiyang Chen; Yuting Jiang; Yasen Hu; Hao Ni; Binyang Li; Guoshuai Zhao,; Jui-Hao Chiang; Zhongxin Guo; Chen Lin; Kun Kuang; Wenjie Li; Yelong Shen,; Jian Jiao; Peng Cheng; Mao Yang

arXiv:2501.13629·cs.CL·February 11, 2025

Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models

Zhenghao Lin, Zihao Tang, Xiao Liu, Yeyun Gong, Yi Cheng, Qi Chen,, Hang Li, Ying Xin, Ziyue Yang, Kailai Yang, Yu Yan, Xiao Liang, Shuai Lu,, Yiming Huang, Zheheng Luo, Lei Qu, Xuan Feng, Yaoxiang Wang, Yuqing Xia,, Feiyang Chen, Yuting Jiang, Yasen Hu, Hao Ni, Binyang Li

PDF

Open Access

TL;DR

Sigma is a specialized large language model for the system domain that employs DiffQKV attention to improve inference efficiency by differentially compressing components, achieving up to 33.36% speedup and outperforming GPT-4 on domain-specific tasks.

Contribution

The paper introduces DiffQKV attention, a novel architecture that enhances efficiency by differentially compressing Q, K, and V components, and pre-trains Sigma on extensive system domain data.

Findings

01

DiffQKV attention improves inference speed by up to 33.36%.

02

Sigma outperforms GPT-4 on system domain benchmarks.

03

Differential compression of K and V components is effective.

Abstract

We introduce Sigma, an efficient large language model specialized for the system domain, empowered by a novel architecture including DiffQKV attention, and pre-trained on our meticulously collected system domain data. DiffQKV attention significantly enhances the inference efficiency of Sigma by optimizing the Query (Q), Key (K), and Value (V) components in the attention mechanism differentially, based on their varying impacts on the model performance and efficiency indicators. Specifically, we (1) conduct extensive experiments that demonstrate the model's varying sensitivity to the compression of K and V components, leading to the development of differentially compressed KV, and (2) propose augmented Q to expand the Q head dimension, which enhances the model's representation capacity with minimal impacts on the inference speed. Rigorous theoretical and empirical analyses reveal that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies

MethodsAttention Is All You Need · Adam · Softmax · Absolute Position Encodings · Residual Connection · Feedforward Network · Dropout · Byte Pair Encoding · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings