MoVE: Mixture of Value Embeddings -- A New Axis for Scaling Parametric Memory in Autoregressive Models
Yangyan Li

TL;DR
MoVE introduces a novel mechanism that decouples model capacity from computational cost in autoregressive models by using a shared bank of learnable value embeddings, enabling scalable parametric memory.
Contribution
The paper presents MoVE, a new architecture that independently scales parametric memory in autoregressive models through a shared embedding bank and differentiable gating.
Findings
MoVE improves performance in text and image generation tasks.
MoVE achieves lower perplexity and higher fidelity at similar compute budgets.
MoVE enables scalable memory without increasing FLOPs.
Abstract
Autoregressive sequence modeling stands as the cornerstone of modern Generative AI, powering results across diverse modalities ranging from text generation to image generation. However, a fundamental limitation of this paradigm is the rigid structural coupling of model capacity to computational cost: expanding a model's parametric memory -- its repository of factual knowledge or visual patterns -- traditionally requires deepening or widening the network, which incurs a proportional rise in active FLOPs. In this work, we introduce , a mechanism that breaks this coupling and establishes a new axis for scaling capacity. MoVE decouples memory from compute by introducing a global bank of learnable value embeddings shared across all attention layers. For every step in the sequence, the model employs a differentiable soft gating mechanism to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Topic Modeling
