Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models
Zhifei Xie, Ziyang Ma, Zihang Liu, Kaiyu Pang, Hongyu Li, Jialin Zhang, Yue Liao, Deheng Ye, Chunyan Miao, Shuicheng Yan

TL;DR
Mini-Omni-Reasoner introduces a token-level interleaving approach for reasoning and speaking in large speech models, enabling real-time, logically grounded speech generation without latency.
Contribution
It proposes a novel 'Thinking-in-Speaking' framework with token-level reasoning, supported by a new dataset, enhancing speech model reasoning capabilities.
Findings
Achieves +19.1% in arithmetic reasoning on Spoken-MQA
Attains +6.4% in contextual understanding
Reduces decoding latency and shortens outputs
Abstract
Reasoning is essential for effective communication and decision-making. While recent advances in LLMs and MLLMs have shown that incorporating explicit reasoning significantly improves understanding and generalization, reasoning in LSMs remains in a nascent stage. Early efforts attempt to transfer the "Thinking-before-Speaking" paradigm from textual models to speech. However, this sequential formulation introduces notable latency, as spoken responses are delayed until reasoning is fully completed, impairing real-time interaction and communication efficiency. To address this, we propose Mini-Omni-Reasoner, a framework that enables reasoning within speech via a novel "Thinking-in-Speaking" formulation. Rather than completing reasoning before producing any verbal output, Mini-Omni-Reasoner interleaves silent reasoning tokens with spoken response tokens at the token level. This design allows…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Large-scale dataset contribution. The paper introduces a 3M-sample spoken math-reasoning dataset with paired reasoning traces. This is a valuable resource and, if released, could meaningfully benefit the community and future research in speech-based reasoning. 2. Clear writing and presentation. The problem, motivation, and methodology are presented clearly, supported by informative figures and conceptual diagrams that make the technical ideas easy to follow. 3. Comprehensive ablation and an
1. Limited task scope and unclear generalization. The method is evaluated only on spoken mathematical reasoning and a single benchmark (Spoken-MQA). While math is a structured domain, the ability to generalize to broader real-world multimodal tasks remains unclear. The current claims are not strong enough given the effectiveness beyond math has not yet demonstrated. 2. Fixed token-ratio design lacks justification and appears overly hand-crafted. The 2:8 speech-to-reasoning token ratio is motiva
- Each step of the training procedure is explained clearly and in sufficient detail. - The idea is interesting and practical. It could make interaction with the model more convenient by reducing user waiting time after speaking. - Experimental results include multiple baseline models, making the comparison relatively comprehensive.
- Although the main idea is "thinking while speaking", the interleaving order between reasoning and response tokens is reversed -- the model first generates response tokens, then reasoning tokens. In this setup, the reasoning tokens serve as post-hoc justification rather than genuine reasoning, which contradicts the motivation of the paper. - The numbers of response and reasoning tokens per interleaving step are fixed (2 and 8, respectively). However, different parts of intermediate steps may re
- The motivation for the paper is clear, the author would like to adress the latency issue in speech reasoning. - The structure of the paper is clear and easy to follow. - The authors design a complete system regarding dataset construction, and well-designed training pipeline. - Experiment results stronly demonstrate the effectiveness of the model with 19.1% arithmetic improvement, +6.4% reasoning accuracy improvement and 63% shorter responses than baselines
- Limited evaluation scope: Only mathematical reasoning tasks are evaluated. No evidence show this can generalize to other domain, e.g. dialogue, creative tasks. - Weak ratio justification: 2:8 ratio derived from GPU throughput (100 vs 12.5 tokens/sec), not cognitive or linguistic principles. No perceptual studies justify 20 spoken tokens/sec target. No ablations testing 1:9, 3:7, 4:6, or dynamic ratios
1. The “thinking-in-speaking” formulation is clearly motivated by modality constraints and addresses the latency issue in the “thinking-before-speaking” paradigm. This is an interesting conceptual shift with potential implications beyond math tasks. 2. The paper combines algorithmic design (token-level interleaving), dataset construction (Spoken-Math-Problems-3M), and a new model (Mini-Omni-Reasoner). Together they demonstrate an end-to-end system for real-time reasoning in speech. 3. The paper
1.Although the 2:8 speech-to-reasoning token ratio is justified by hardware throughput, hardcoding this ratio limits adaptability. If deployment hardware or inference speed changes, both model and dataset construction would need to be redone. It would strengthen the work to include an ablation or adaptive mechanism for this ratio. In related LLM work such as Quiet-STaR [1], increasing the number of “thinking” tokens can systematically improve reasoning. I am wondering if this is case for Mini-Om
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
