MEMORY-VQ: Compression for Tractable Internet-Scale Memory

Yury Zemlyanskiy; Michiel de Jong; Luke Vilnis; Santiago Onta\~n\'on,; William W. Cohen; Sumit Sanghai; Joshua Ainslie

arXiv:2308.14903·cs.CL·August 30, 2023

MEMORY-VQ: Compression for Tractable Internet-Scale Memory

Yury Zemlyanskiy, Michiel de Jong, Luke Vilnis, Santiago Onta\~n\'on,, William W. Cohen, Sumit Sanghai, Joshua Ainslie

PDF

Open Access 1 Video

TL;DR

MEMORY-VQ introduces a vector quantization approach to compress token representations in memory-augmented language models, significantly reducing storage needs while maintaining performance, thus enabling scalable retrieval augmentation.

Contribution

The paper presents MEMORY-VQ, a novel compression method using VQ-VAE to reduce memory storage in retrieval-augmented models without performance loss.

Findings

01

Achieves 16x compression rate with LUMEN-VQ.

02

Maintains comparable performance on KILT benchmark.

03

Enables practical retrieval augmentation for large corpora.

Abstract

Retrieval augmentation is a powerful but expensive method to make language models more knowledgeable about the world. Memory-based methods like LUMEN pre-compute token representations for retrieved passages to drastically speed up inference. However, memory also leads to much greater storage requirements from storing pre-computed representations. We propose MEMORY-VQ, a new method to reduce storage requirements of memory-augmented models without sacrificing performance. Our method uses a vector quantization variational autoencoder (VQ-VAE) to compress token representations. We apply MEMORY-VQ to the LUMEN model to obtain LUMEN-VQ, a memory model that achieves a 16x compression rate with comparable performance on the KILT benchmark. LUMEN-VQ enables practical retrieval augmentation even for extremely large retrieval corpora.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MEMORY-VQ: Compression for Tractable Internet-Scale Memory· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings