Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models
Yuan Sui, Yufei He, Tri Cao, Simeng Han, Yulin Chen, Bryan Hooi

TL;DR
Meta-Reasoner introduces a dynamic, strategy-adapting framework using contextual multi-armed bandits to improve inference efficiency and accuracy in large language models during multi-step reasoning tasks.
Contribution
It presents a novel meta-guidance framework that enables LLMs to adapt reasoning strategies in real-time, enhancing performance and efficiency.
Findings
Outperforms previous SOTA methods by 9-12% in accuracy.
Reduces inference time by 28-35% under the same compute budget.
Demonstrates generalizability across diverse reasoning tasks.
Abstract
Large Language Models (LLMs) often struggle with computational efficiency and error propagation in multi-step reasoning tasks. While recent advancements on prompting and post-training have enabled LLMs to perform step-wise reasoning, they still tend to explore unproductive solution paths without effective backtracking or strategy adjustment. In this paper, we propose Meta-Reasoner, a new framework that empowers LLMs to "think about how to think". It optimizes the inference process by dynamically adapting reasoning strategies in real-time. Our approach employs contextual multi-armed bandits (CMABs) to learn an adaptive policy. It learns to evaluate the current state of LLM's reasoning and determine optimal strategy that is most likely to lead to a successful outcome during inference, like whether to backtrack, switch to a new approach, or restart the problem-solving process. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
