Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration
Yifu Guo, Zishan Xu, Zhiyuan Yao, Yuquan Lu, Jiaye Lin, Sen Hu, Zhenheng Tang, Huacan Wang, Ronghao Chen

TL;DR
Octopus introduces a new multimodal reasoning framework that enables autonomous exploration and dynamic capability selection, significantly improving performance across diverse tasks by mimicking human-like reasoning abilities.
Contribution
It defines six core reasoning capabilities, organizes a comprehensive benchmark, and demonstrates a system that autonomously orchestrates these capabilities for improved multimodal reasoning.
Findings
Achieves state-of-the-art performance on Octopus-Bench tasks
Demonstrates effective autonomous capability exploration and selection
Highlights importance of capability coordination in multimodal reasoning
Abstract
Existing multimodal reasoning models and frameworks suffer from fundamental architectural limitations: most lack the human-like ability to autonomously explore diverse reasoning pathways-whether in direct inference, tool-driven visual exploration, programmatic visual manipulation, or intrinsic visual imagination. Consequently, they struggle to adapt to dynamically changing capability requirements in real-world tasks. Meanwhile, humans exhibit a complementary set of thinking abilities when addressing such tasks, whereas existing methods typically cover only a subset of these dimensions. Inspired by this, we propose Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration, a new paradigm for multimodal agentic reasoning. We define six core capabilities essential for multimodal reasoning and organize a comprehensive evaluation benchmark, Octopus-Bench, accordingly. Octopus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Robot Manipulation and Learning
