Reasoning Language Model Inference Serving Unveiled: An Empirical Study
Qi Li, Junpan Wu, Xiang Liu, Yuxin Wang, Zeyu Li, Zhenheng Tang, Yuhan Chen, Shaohuai Shi, Xiaowen Chu

TL;DR
This paper provides a comprehensive empirical analysis of reasoning large language model (RLLM) inference serving, revealing unique performance behaviors, evaluating optimization techniques, and validating findings with real-world workload tests.
Contribution
It is the first detailed empirical study on RLLM serving performance, behavior, and optimization effectiveness in real-world scenarios.
Findings
RLLM serving exhibits high memory usage, fluctuations, and adaptive response times.
Model quantization and speculative decoding improve efficiency with minimal accuracy loss.
Prefix caching and KV cache quantization may reduce accuracy or performance for small RLLMs.
Abstract
The reasoning large language model (RLLM) has been proven competitive in solving complex reasoning tasks such as mathematics, coding, compared to general LLM. However, the serving performance and behavior of RLLM remains unexplored, which may undermine the deployment and utilization of RLLM in real-world scenario. To close this gap, in this paper, we conduct a comprehensive study of RLLM service. We first perform a pilot study on comparing the serving performance between RLLM and traditional LLM and reveal that there are several distinct differences regarding serving behavior: (1) significant memory usage and fluctuations; (2) straggler requests; (3) adaptive running time; (4) domain preference. Then we further investigate whether existing inference optimization techniques are valid for RLLM. Our main takeaways are that model quantization methods and speculative decoding can improve…
Peer Reviews
Decision·ICLR 2026 Poster
``S1``: As long-CoT reasoning models become mainstream with the emergence of many prevalent deep reasoning models, their serving behavior is a an important systems problem. ``S2``: Good breadth of experiments across engines (vLLM and SGLang), model scales, and several optimizations (weights, KV, prefix cache, speculation). ``S3``: The summary of observations and findings are clear.
``W1``: The study focuses on GSM8K, MATH-500, AIME-2024, GPQA. Given rapid shifts in eval suites and the benchmark nature of this paper, I’d suggest the authors add SuperGPQA and CommonsenseQA. ``W2``: I would have expected to see some elaborated explanations of some observations in the paper. For example, the authors claim that: - “We find that for sufficiently large RLLMs (14B and above), prefix caching significantly improves runtime speed and serving metrics without compromising performance
Novel and relevant empirical perspective. This is one of the first systematic studies on the serving characteristics of reasoning-oriented LLMs, which are becoming increasingly important in practice. The empirical exploration fills a gap between model-level reasoning research and system-level inference efficiency. Comprehensive experimental coverage. The paper evaluates multiple model scales (7B–70B), reasoning datasets (GSM8K, MATH500, AIME24, GPQA), and optimization techniques (quantization,
Lack of explanation for partial observations. Several empirical findings are not sufficiently explained. For instance, the paper reports that prefix caching provides little or even negative benefit for 7B reasoning models, but no analysis is given on why this happens. Without such interpretation, the results remain descriptive rather than insightful. Missing comparison with standard LLMs in Section 5. Section 5 focuses on evaluating the effectiveness of several optimization techniques—such as q
1. The proposed benchmark and framework for evaluating RLLM serving are valuable contributions, especially as RLLMs become increasingly prevalent. 2. The empirical study is extensive, offering interesting observations on both serving behaviors and serving optimization techniques for RLLMs. These findings should be useful for researchers and practitioners aiming to improve systems for serving LLMs. 3 The serving performance of reasoning models remains under-explored, and this paper helps fill tha
This paper primarily presents empirical observations without offering much in-depth analysis. The authors treat the models largely as black boxes, running benchmark experiments and reporting results without providing deeper insights or interpretations. Some of the evaluations, while interesting, have been explored in prior work, such as comparisons between RLLMs and LLMs that do not focus on the serving aspect. It would strengthen the paper to narrow the scope and focus more clearly on serving-
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Topic Modeling · Natural Language Processing Techniques
