TL;DR
This paper introduces ASPD, a novel decoding method that exploits intrinsic parallelism in LLM outputs to significantly accelerate inference speed while maintaining high response quality.
Contribution
We propose an adaptive decoding framework that automatically identifies parallelizable structures in autoregressive outputs and enables seamless serial-parallel decoding transitions.
Findings
Achieves up to 3.19x speedup on Vicuna Bench
Maintains response quality within 1% of autoregressive models
Demonstrates effectiveness across diverse tasks
Abstract
The increasing scale and complexity of large language models (LLMs) pose significant inference latency challenges, primarily due to their autoregressive decoding paradigm characterized by the sequential nature of next-token prediction. By re-examining the outputs of autoregressive models, we observed that some segments exhibit parallelizable structures, which we term intrinsic parallelism. Decoding each parallelizable branch simultaneously (i.e. parallel decoding) can significantly improve the overall inference speed of LLMs. In this paper, we propose an Adaptive Serial-Parallel Decoding (ASPD), which addresses two core challenges: automated construction of parallelizable data and efficient parallel decoding mechanism. More specifically, we introduce a non-invasive pipeline that automatically extracts and validates parallelizable structures from the responses of autoregressive models.…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper tackles an interesting aspect of the LLM parallelism. And the found intrinsic parallelism such as lists are interesting. - The experiments are comprehensive and thorough, covering different reasoning tasks such as STEM, roleplay, reasoning, and extraction tasks.
- Speedups for certain tasks such as mathematics reasoning are limited. For example, the speedup on MATH500 is 1.17x, much lower than the 1.82x achieved on Vicuna Bench. - The method is dependent on task structure. Mathematical reasoning, for instance, involves "strong inter-step dependencies" and "step-by-step deductions," which naturally reduces the opportunities for parallelization. - The training overhead seems to be missing. What are the training overhead and how long does it take? Consid
- ASPD enables parallel decoding while addressing the weaknesses of previous work (no sequential decoding after parallelizing in APAR; approximated position IDs disrupting position continuity in Pasta) - The paper is generally well written and easy to understand, which the figures giving a very clear overview of the methodology and of differences with previous works. - The experiments show that ASPD achieves the greatest tokens/sec and highest quality compared to APAR, SOT, and sequential across
- The paper does not present the wall clock latency speedup of the different methods, but only tokens/sec and other efficiency metrics which do not account for actual system overheads to the methodology. As a speed-oriented parallelization method, wall clock speedup is an important evaluation metric. - It seems that the main difference between ASPD and Pasta is that in ASPD the position ID is maintained as if the tokens generated in parallel were actually sequential (i.e. ground truth position
Parallel decoding is a promising technique for inference acceleration. While previous works mainly focus on token-level parallel decoding (i.e., decode multiple tokens simultaneously), this paper leverages the intrinsic parallelism in LLMs. This is a good motivation. Speed gains across diverse domains and models, with minimal trade‑off in output quality.
see questions
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
