Continuous Autoregressive Language Models
Chenze Shao, Darren Li, Fandong Meng, Jie Zhou

TL;DR
CALM introduces continuous vector prediction for language modeling, reducing generative steps and computational cost while maintaining high accuracy, representing a scalable new approach for efficient large language models.
Contribution
This work pioneers the shift from discrete token prediction to continuous vector prediction in language models, enabling more efficient and scalable LLMs.
Findings
Achieves over 99.9% token reconstruction accuracy
Reduces generative steps by a factor of K
Improves performance-compute trade-off significantly
Abstract
The efficiency of large language models (LLMs) is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis for LLM scaling: increasing the semantic bandwidth of each generative step. To this end, we introduce Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9\% accuracy. This allows us to model language as a sequence of continuous vectors instead of discrete tokens, which reduces the number of generative steps by a factor of K. The paradigm shift necessitates a new modeling toolkit; therefore, we develop a comprehensive likelihood-free…
Peer Reviews
Decision·Submitted to ICLR 2026
- The method goes beyond standard next-token prediction and allows for hierarchical modeling of language. - The proposed methodology for training and evaluation is sound.
See questions.
- CALM reduces both train and inference FLOPs compared to a vanilla transformer - the generation quality also seems to be better (according to Brier Score)
- my main concern is limited evaluations: - How does CALM perform on standard LM evals like HellaSwag, PIQA etc.? - More importantly, i'm curious about in-context recall abilities of CALM. It has been observed that many efficient architectures match vanilla transformer in perplexity, simple LM evals etc., but they lack the ability to recall specific tokens from the past. How does CALM perform on tasks from the EVAPORATE suite: https://huggingface.co/collections/hazyresearch/evaporate-su
1. A new vision of looking into LLM training. 2. Thorough process to define new loss functions, metrics and decoding mechanism for the proposed LM.
1. Comparisons with existing multi-token prediction methods 2. Accuracy of generative benchmarks 3. Although not explicitly but some papers like 1. Medusa: https://arxiv.org/abs/2401.10774, 2: Your LLM knows the future, https://arxiv.org/pdf/2507.11851 encode this set of token behavior in the model embedding in its current form. So a section explaining this correlation can be useful.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
