ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs

Keyu Chen; Zhifeng Shen; Daohai Yu; Haoqian Wu; Wei Wen; Jianfeng He; Ruizhi Qiao; Xing Sun

arXiv:2508.08895·cs.CL·August 15, 2025

ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs

Keyu Chen, Zhifeng Shen, Daohai Yu, Haoqian Wu, Wei Wen, Jianfeng He, Ruizhi Qiao, Xing Sun

PDF

3 Reviews

TL;DR

This paper introduces ASPD, a novel decoding method that exploits intrinsic parallelism in LLM outputs to significantly accelerate inference speed while maintaining high response quality.

Contribution

We propose an adaptive decoding framework that automatically identifies parallelizable structures in autoregressive outputs and enables seamless serial-parallel decoding transitions.

Findings

01

Achieves up to 3.19x speedup on Vicuna Bench

02

Maintains response quality within 1% of autoregressive models

03

Demonstrates effectiveness across diverse tasks

Abstract

The increasing scale and complexity of large language models (LLMs) pose significant inference latency challenges, primarily due to their autoregressive decoding paradigm characterized by the sequential nature of next-token prediction. By re-examining the outputs of autoregressive models, we observed that some segments exhibit parallelizable structures, which we term intrinsic parallelism. Decoding each parallelizable branch simultaneously (i.e. parallel decoding) can significantly improve the overall inference speed of LLMs. In this paper, we propose an Adaptive Serial-Parallel Decoding (ASPD), which addresses two core challenges: automated construction of parallelizable data and efficient parallel decoding mechanism. More specifically, we introduce a non-invasive pipeline that automatically extracts and validates parallelizable structures from the responses of autoregressive models.…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 2

Strengths

- The paper tackles an interesting aspect of the LLM parallelism. And the found intrinsic parallelism such as lists are interesting. - The experiments are comprehensive and thorough, covering different reasoning tasks such as STEM, roleplay, reasoning, and extraction tasks.

Weaknesses

- Speedups for certain tasks such as mathematics reasoning are limited. For example, the speedup on MATH500 is 1.17x, much lower than the 1.82x achieved on Vicuna Bench. - The method is dependent on task structure. Mathematical reasoning, for instance, involves "strong inter-step dependencies" and "step-by-step deductions," which naturally reduces the opportunities for parallelization. - The training overhead seems to be missing. What are the training overhead and how long does it take? Consid

Reviewer 02Rating 4Confidence 4

Strengths

- ASPD enables parallel decoding while addressing the weaknesses of previous work (no sequential decoding after parallelizing in APAR; approximated position IDs disrupting position continuity in Pasta) - The paper is generally well written and easy to understand, which the figures giving a very clear overview of the methodology and of differences with previous works. - The experiments show that ASPD achieves the greatest tokens/sec and highest quality compared to APAR, SOT, and sequential across

Weaknesses

- The paper does not present the wall clock latency speedup of the different methods, but only tokens/sec and other efficiency metrics which do not account for actual system overheads to the methodology. As a speed-oriented parallelization method, wall clock speedup is an important evaluation metric. - It seems that the main difference between ASPD and Pasta is that in ASPD the position ID is maintained as if the tokens generated in parallel were actually sequential (i.e. ground truth position

Reviewer 03Rating 4Confidence 4

Strengths

Parallel decoding is a promising technique for inference acceleration. While previous works mainly focus on token-level parallel decoding (i.e., decode multiple tokens simultaneously), this paper leverages the intrinsic parallelism in LLMs. This is a good motivation. Speed gains across diverse domains and models, with minimal trade‑off in output quality.

Weaknesses

see questions

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.