Value Residual Learning

Zhanchao Zhou; Tianyi Wu; Zhiyun Jiang; Fares Obeid; Zhenzhong Lan

arXiv:2410.17897·cs.CL·June 10, 2025

Value Residual Learning

Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Fares Obeid, Zhenzhong Lan

PDF

Open Access 1 Repo 5 Models 1 Video

TL;DR

This paper introduces ResFormer, a Transformer variant with value residual connections that improves information flow, reduces model size and training data needs, and enhances efficiency in memory and cache usage.

Contribution

ResFormer incorporates value residuals to enhance information propagation in Transformers, achieving comparable performance with fewer parameters and less data, and introducing a variant that reduces KV cache size.

Findings

01

ResFormer achieves similar validation loss with 16.11% fewer parameters.

02

ResFormer uses 20.3% less training data than standard Transformer.

03

SVFormer reduces KV cache size by nearly half.

Abstract

While Transformer models have achieved remarkable success in various domains, the effectiveness of information propagation through deep networks remains a critical challenge. Standard hidden state residuals often fail to adequately preserve initial token-level information in deeper layers. This paper introduces ResFormer, a novel architecture that enhances information flow by incorporating value residual connections in addition to hidden state residuals. And a variant is SVFormer, where all layers share the first layer's value embedding. Comprehensive empirical evidence demonstrates ResFormer achieves equivalent validation loss with 16.11\% fewer model parameters and 20.3\% less training data compared to Transformer, while maintaining similar memory usage and computational cost. Besides, SVFormer reduces KV cache size by nearly half with only a small performance penalty and can be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Zcchill/Value-Residual-Learning
pytorchOfficial

Models

Videos

Value Residual Learning· underline

Taxonomy

TopicsNeural Networks and Applications · Industrial Vision Systems and Defect Detection · Fault Detection and Control Systems

MethodsAttention Sinks · Linear Layer · Layer Normalization · Position-Wise Feed-Forward Layer · Attention Is All You Need · Dense Connections · Softmax · Multi-Head Attention · Adam · Dropout