Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models
Zhenghao Lin, Zihao Tang, Xiao Liu, Yeyun Gong, Yi Cheng, Qi Chen,, Hang Li, Ying Xin, Ziyue Yang, Kailai Yang, Yu Yan, Xiao Liang, Shuai Lu,, Yiming Huang, Zheheng Luo, Lei Qu, Xuan Feng, Yaoxiang Wang, Yuqing Xia,, Feiyang Chen, Yuting Jiang, Yasen Hu, Hao Ni, Binyang Li

TL;DR
Sigma is a specialized large language model for the system domain that employs DiffQKV attention to improve inference efficiency by differentially compressing components, achieving up to 33.36% speedup and outperforming GPT-4 on domain-specific tasks.
Contribution
The paper introduces DiffQKV attention, a novel architecture that enhances efficiency by differentially compressing Q, K, and V components, and pre-trains Sigma on extensive system domain data.
Findings
DiffQKV attention improves inference speed by up to 33.36%.
Sigma outperforms GPT-4 on system domain benchmarks.
Differential compression of K and V components is effective.
Abstract
We introduce Sigma, an efficient large language model specialized for the system domain, empowered by a novel architecture including DiffQKV attention, and pre-trained on our meticulously collected system domain data. DiffQKV attention significantly enhances the inference efficiency of Sigma by optimizing the Query (Q), Key (K), and Value (V) components in the attention mechanism differentially, based on their varying impacts on the model performance and efficiency indicators. Specifically, we (1) conduct extensive experiments that demonstrate the model's varying sensitivity to the compression of K and V components, leading to the development of differentially compressed KV, and (2) propose augmented Q to expand the Q head dimension, which enhances the model's representation capacity with minimal impacts on the inference speed. Rigorous theoretical and empirical analyses reveal that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
MethodsAttention Is All You Need · Adam · Softmax · Absolute Position Encodings · Residual Connection · Feedforward Network · Dropout · Byte Pair Encoding · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
