ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding
Yiran Guan, Sifan Tu, Dingkang Liang, Linghao Zhu, Jianzhong Ju, Zhenbo Luo, Jian Luan, Yuliang Liu, Xiang Bai

TL;DR
ThinkOmni is a training-free framework that enhances omni-modal reasoning in large language models by guiding decoding with off-the-shelf reasoning models and adaptively balancing perception and reasoning signals, improving performance across multiple benchmarks.
Contribution
It introduces a novel, training-free approach that leverages existing reasoning models to boost omni-modal reasoning in large language models without additional data or training.
Findings
Achieves 70.2 on MathVista benchmark.
Achieves 75.5 on MMAU benchmark.
Demonstrates consistent performance improvements across six multi-modal reasoning benchmarks.
Abstract
Omni-modal reasoning is essential for intelligent systems to understand and draw inferences from diverse data sources. While existing omni-modal large language models (OLLM) excel at perceiving diverse modalities, they lack the complex reasoning abilities of recent large reasoning models (LRM). However, enhancing the reasoning ability of OLLMs through additional training presents significant challenges, including the need for high-quality data, task-specific adaptation, and substantial computational costs. To address these limitations, we propose ThinkOmni, a training-free and data-free framework that lifts textual reasoning to omni-modal scenarios. ThinkOmni introduces two key components: 1) LRM-as-a-Guide, which leverages off-the-shelf LRMs to guide the OLLM decoding process; 2) Stepwise Contrastive Scaling, which adaptively balances perception and reasoning signals without manual…
Peer Reviews
Decision·ICLR 2026 Poster
- The idea of using a text-only reasoning model to guide an omni-modal model during inference is a clever approach to addressing the scarcity of high-quality omni-modal reasoning data. While guidance decoding exists, applying it specifically to lift reasoning capabilities from one modality to many is a novel application. - The proposed Stepwise Contrastive Scaling is a technically sound contribution. By using Jensen-Shannon divergence to measure the "disagreement" or unique contribution of reas
- (minor, ack'ed by the authors) The framework relies on logit fusion, which strictly requires the OLLM and the LRM to share the exact same tokenizer vocabulary. This is a major limitation to the claim of using "off-the-shelf" LRMs, as it restricts acceptable pairings to models within the same family (e.g., Qwen-based OLLMs with Qwen-based LRMs). - Since the LRM never sees the omni-modal input and only sees the text trace, there is a theoretical risk that in highly visual/auditory tasks where t
- The proposed Stepwise Contrastive Scaling mechanism is a meaningful contribution, enabling adaptive tuning of guidance weights without manual hyperparameter search. - Extensive experiments across six benchmarks demonstrate the generality and scalability of the method. - The framework is model-agnostic and can be applied to various OLLM and LRM combinations, enhancing its potential impact. - The idea of leveraging pre-trained LRMs to enhance OLLM reasoning without additional training is novel a
- The approach requires external LLMs to guide the response. This led to doubts about where the performance gain comes from. Does LLM introduce extra information for solving the question? - Although training-free, the approach incurs non-trivial inference overhead due to multiple forward passes per decoding step, which may hinder real-time deployment.
1. Innovative Framework: The paper proposes ThinkOmni, a training-free, decoding-time method that cleverly integrates a Large Reasoning Model (LRM) into an omni-modal LLM (OLLM) pipeline. The LRM-as-a-Guideand Stepwise Contrastive Scaling mechanisms are novel and well-motivated. 2. Strong Empirical Results: The method shows consistent and significant improvements across six diverse omni-modal reasoning benchmarks. The gains are particularly notable given that no additional training is required.
1. Limited Focus on Math Domain: The benchmark is currently restricted to the math domain at the image-level. It would be beneficial to expand the experiments to include perception-level tasks, such as MME, as well as reasoning benchmarks like MMLU, to provide a more comprehensive evaluation. 2. Confusion Regarding Figure 4: There is some confusion regarding the notation in Figure 4, specifically the terms \( x_{<t} \) and \( x_{<t+1} \). These appear to suggest a multi-turn conversational set
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
