Continuous Autoregressive Language Models

Chenze Shao; Darren Li; Fandong Meng; Jie Zhou

arXiv:2510.27688·cs.CL·November 3, 2025

Continuous Autoregressive Language Models

Chenze Shao, Darren Li, Fandong Meng, Jie Zhou

PDF

Open Access 5 Models 3 Reviews

TL;DR

CALM introduces continuous vector prediction for language modeling, reducing generative steps and computational cost while maintaining high accuracy, representing a scalable new approach for efficient large language models.

Contribution

This work pioneers the shift from discrete token prediction to continuous vector prediction in language models, enabling more efficient and scalable LLMs.

Findings

01

Achieves over 99.9% token reconstruction accuracy

02

Reduces generative steps by a factor of K

03

Improves performance-compute trade-off significantly

Abstract

The efficiency of large language models (LLMs) is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis for LLM scaling: increasing the semantic bandwidth of each generative step. To this end, we introduce Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9\% accuracy. This allows us to model language as a sequence of continuous vectors instead of discrete tokens, which reduces the number of generative steps by a factor of K. The paradigm shift necessitates a new modeling toolkit; therefore, we develop a comprehensive likelihood-free…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

- The method goes beyond standard next-token prediction and allows for hierarchical modeling of language. - The proposed methodology for training and evaluation is sound.

Weaknesses

See questions.

Reviewer 02Rating 4Confidence 3

Strengths

- CALM reduces both train and inference FLOPs compared to a vanilla transformer - the generation quality also seems to be better (according to Brier Score)

Weaknesses

- my main concern is limited evaluations: - How does CALM perform on standard LM evals like HellaSwag, PIQA etc.? - More importantly, i'm curious about in-context recall abilities of CALM. It has been observed that many efficient architectures match vanilla transformer in perplexity, simple LM evals etc., but they lack the ability to recall specific tokens from the past. How does CALM perform on tasks from the EVAPORATE suite: https://huggingface.co/collections/hazyresearch/evaporate-su

Reviewer 03Rating 4Confidence 4

Strengths

1. A new vision of looking into LLM training. 2. Thorough process to define new loss functions, metrics and decoding mechanism for the proposed LM.

Weaknesses

1. Comparisons with existing multi-token prediction methods 2. Accuracy of generative benchmarks 3. Although not explicitly but some papers like 1. Medusa: https://arxiv.org/abs/2401.10774, 2: Your LLM knows the future, https://arxiv.org/pdf/2507.11851 encode this set of token behavior in the model embedding in its current form. So a section explaining this correlation can be useful.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods