MoVE: Mixture of Value Embeddings -- A New Axis for Scaling Parametric Memory in Autoregressive Models

Yangyan Li

arXiv:2601.22887·cs.LG·February 2, 2026

MoVE: Mixture of Value Embeddings -- A New Axis for Scaling Parametric Memory in Autoregressive Models

Yangyan Li

PDF

Open Access

TL;DR

MoVE introduces a novel mechanism that decouples model capacity from computational cost in autoregressive models by using a shared bank of learnable value embeddings, enabling scalable parametric memory.

Contribution

The paper presents MoVE, a new architecture that independently scales parametric memory in autoregressive models through a shared embedding bank and differentiable gating.

Findings

01

MoVE improves performance in text and image generation tasks.

02

MoVE achieves lower perplexity and higher fidelity at similar compute budgets.

03

MoVE enables scalable memory without increasing FLOPs.

Abstract

Autoregressive sequence modeling stands as the cornerstone of modern Generative AI, powering results across diverse modalities ranging from text generation to image generation. However, a fundamental limitation of this paradigm is the rigid structural coupling of model capacity to computational cost: expanding a model's parametric memory -- its repository of factual knowledge or visual patterns -- traditionally requires deepening or widening the network, which incurs a proportional rise in active FLOPs. In this work, we introduce $MoVE (Mixture of Value Embeddings)$ , a mechanism that breaks this coupling and establishes a new axis for scaling capacity. MoVE decouples memory from compute by introducing a global bank of learnable value embeddings shared across all attention layers. For every step in the sequence, the model employs a differentiable soft gating mechanism to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Topic Modeling