GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill   and Extreme KV-Cache Compression

Daniel Goldstein; Fares Obeid; Eric Alcaide; Guangyu Song; Eugene; Cheah

arXiv:2407.12077·cs.CL·July 18, 2024·2 cites

GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression

Daniel Goldstein, Fares Obeid, Eric Alcaide, Guangyu Song, Eugene, Cheah

PDF

Open Access 2 Repos 1 Models

TL;DR

GoldFinch introduces a hybrid linear attention/transformer model with a novel highly compressed and reusable KV-Cache, enabling efficient large-context inference with significantly reduced memory requirements and fast pre-fill computation.

Contribution

The paper presents GoldFinch, a new hybrid model with a novel cache compression technique that drastically reduces memory usage and improves efficiency over prior transformer architectures.

Findings

01

Cache size savings range from 756 to 2550 times smaller than traditional transformers.

02

GoldFinch achieves better modeling performance than Finch and Llama.

03

Pre-fill cache computation costs only O(1) per token due to RNN-based cache generation.

Abstract

We introduce GoldFinch, a hybrid Linear Attention/Transformer sequence model that uses a new technique to efficiently generate a highly compressed and reusable KV-Cache in linear time and space with respect to sequence length. GoldFinch stacks our new GOLD transformer on top of an enhanced version of the Finch (RWKV-6) architecture. We train up to 1.5B parameter class models of the Finch, Llama, and GoldFinch architectures, and find dramatically improved modeling performance relative to both Finch and Llama. Our cache size savings increase linearly with model layer count, ranging from 756-2550 times smaller than the traditional transformer cache for common sizes, enabling inference of extremely large context lengths even on limited hardware. Although autoregressive generation has O(n) time complexity per token because of attention, pre-fill computation of the entire initial cache state…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
recursal/GoldFinch-paper
model· ♡ 6
♡ 6

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSilicon Carbide Semiconductor Technologies · VLSI and Analog Circuit Testing

MethodsFirst Integer Neighbor Clustering Hierarchy