Fast and Expressive Multi-Token Prediction with Probabilistic Circuits
Andreas Grivas, Lorenzo Loconte, Emile van Krieken, Piotr Nawrot, Yu Zhao, Euan Wielewski, Pasquale Minervini, Edoardo Ponti, Antonio Vergari

TL;DR
This paper introduces MTPC, a probabilistic circuit framework for multi-token prediction in large language models, balancing expressiveness and speed, and demonstrating significant acceleration without performance loss.
Contribution
The work presents a novel probabilistic circuit approach for multi-token prediction that generalizes existing models and improves generation speed in byte-level LLMs.
Findings
MTPC accelerates generation when combined with speculative decoding.
Retrofitting MTPC retains original LLM performance.
Exploration of trade-offs guides optimal model configurations.
Abstract
Multi-token prediction (MTP) is a prominent strategy to significantly speed up generation in large language models (LLMs), including byte-level LLMs, which are tokeniser-free but prohibitively slow. However, existing MTP methods often sacrifice expressiveness by assuming independence between future tokens. In this work, we investigate the trade-off between expressiveness and latency in MTP within the framework of probabilistic circuits (PCs). Our framework, named MTPC, allows one to explore different ways to encode the joint distributions over future tokens by selecting different circuit architectures, generalising classical models such as (hierarchical) mixture models, hidden Markov models and tensor networks. We show the efficacy of MTPC by retrofitting existing byte-level LLMs, such as EvaByte. Our experiments show that, when combined with speculative decoding, MTPC significantly…
Peer Reviews
Decision·Submitted to ICLR 2026
- I found the paper to overall be well-written, aside from a few nitpicks that I've highlighted in my questions below. - The paper offers a general, principled framework that encompasses several of the previous works. - By exploiting connections to previous work, the authors manage to increase the expressiveness of the drafters while minimizing the latency for an overall improved throughput of 1.22x
- The paper deals with byte-level LLMs which in my opinion greatly limits its scope as it's hard to draw strong conclusion about its performance on sub-word LLMs that are a lot more commonly used by the community. - The paper details the requirement to train the MTPC which by the authors' description is a very arduous process, and could therefore limit adoptability of the proposed approach.
1. MTPC provides a unified probabilistic circuit framework that systematically navigates MTP design space, introducing novel HMM and BTree architectures with BTree achieving optimal throughput by parallelizing latent sampling while maintaining high acceptance rates. 2. The paper rigorously examines trade-offs across PC architecture selection (FF/CP/HMM/BTree) and partial layer sharing via LoRA (0-4 layers), revealing device-specific optimal configurations through systematic ablations across mixt
1. All experiments focus exclusively on EvaByte (6.5B byte-level model with v=320), without validation on subword-level LLMs where vocabularies are 300× larger or across different model families/sizes, limiting claims about scalability. 2. Key design decisions including inhomogeneous HMMs, identity matrix initialization, and why BTree outperforms CP lack theoretical justification beyond empirical validation, with no analysis of when specific architectures excel for different prompt characteristi
1. The paper introduces MTPC, a multi-token prediction framework built on probabilistic circuits, which overcomes the independence assumptions of prior MTP methods. This allows MTPC to model joint token dependencies more effectively than factorized or tensor-decomposition-based approaches. 2. The paper rigorously studies the trade-offs between acceptance rate and generation latency across different PC architectures and different levels of layer sharing. This provides a clear and interpretable d
1. Experiments focus on a single 6.5B byte-level model (EvaByte) and one SFT mixture (Tülu-3). It would strengthen claims to show transfer to a subword LLM (to decouple gains from byte vocabularies) and to other additional datasets/domains. 2. While the loss and discounting are described, ablations on optimization sensitivity (γ, window overlap, head depth/width) are limited. Providing more ablation studies would strengthen the paper. 3. The paper emphasizes latency but gives fewer numbers on
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Adversarial Robustness in Machine Learning
