Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study

Yujie Lin; Ante Wang; Moye Chen; Jingyao Liu; Hao Liu; Jinsong Su; Xinyan Xiao

arXiv:2502.11514·cs.CL·May 27, 2025

Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study

Yujie Lin, Ante Wang, Moye Chen, Jingyao Liu, Hao Liu, Jinsong Su, Xinyan Xiao

PDF

Open Access

TL;DR

This paper explores the effects of inference-time scaling on multi-modal reasoning, combining visual and textual inputs, and evaluates different methods on diverse tasks to understand benefits and challenges.

Contribution

It pioneers the study of inference-time scaling for multi-modal thought, analyzing its impact and challenges across various reasoning tasks.

Findings

01

Multi-modal thought improves reasoning performance over text-only approaches.

02

Blending visual and textual reasoning fosters more diverse thinking.

03

Multi-modal thoughts require higher token consumption, raising practical concerns.

Abstract

Recently, inference-time scaling of chain-of-thought (CoT) has been demonstrated as a promising approach for addressing multi-modal reasoning tasks. While existing studies have predominantly centered on text-based thinking, the integration of both visual and textual modalities within the reasoning process remains unexplored. In this study, we pioneer the exploration of inference-time scaling with multi-modal thought, aiming to bridge this gap. To provide a comprehensive analysis, we systematically investigate popular sampling-based and tree search-based inference-time scaling methods on 10 challenging tasks spanning various domains. Besides, we uniformly adopt a consistency-enhanced verifier to ensure effective guidance for both methods across different thought paradigms. Results show that multi-modal thought promotes better performance against conventional text-only thought, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Visualization and Analytics · Advanced Text Analysis Techniques

MethodsADaptive gradient method with the OPTimal convergence rate