HGRN2: Gated Linear RNNs with State Expansion

Zhen Qin; Songlin Yang; Weixuan Sun; Xuyang Shen; Dong Li; Weigao Sun,; Yiran Zhong

arXiv:2404.07904·cs.CL·August 20, 2024·1 cites

HGRN2: Gated Linear RNNs with State Expansion

Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun,, Yiran Zhong

PDF

Open Access 4 Repos 2 Models

TL;DR

HGRN2 introduces an outer product-based state expansion to Gated Linear RNNs, significantly increasing their expressiveness and efficiency without extra parameters, leading to improved language modeling performance.

Contribution

The paper proposes a novel state expansion mechanism for HGRN, enhancing its capacity and interpretability while maintaining parameter efficiency.

Findings

01

HGRN2 outperforms original HGRN across various benchmarks.

02

The state expansion enables linear attention interpretation.

03

HGRN2 achieves competitive results with other recurrent models.

Abstract

Hierarchically gated linear RNN (HGRN, \citealt{HGRN}) has demonstrated competitive training speed and performance in language modeling while offering efficient inference. However, the recurrent state size of HGRN remains relatively small, limiting its expressiveness. To address this issue, we introduce a simple outer product-based state expansion mechanism, which significantly enlarges the recurrent state size without introducing any additional parameters. This enhancement also provides a linear attention interpretation for HGRN2, enabling hardware-efficient training. Our extensive experiments verify the advantage of HGRN2 over HGRN consistently across different settings and competitive with other recurrent models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Neural Network Applications · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Dense Connections · Label Smoothing · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Residual Connection · Multi-Head Attention · Adam · Dropout