Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration

Yifu Guo; Zishan Xu; Zhiyuan Yao; Yuquan Lu; Jiaye Lin; Sen Hu; Zhenheng Tang; Huacan Wang; Ronghao Chen

arXiv:2511.15351·cs.AI·December 15, 2025

Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration

Yifu Guo, Zishan Xu, Zhiyuan Yao, Yuquan Lu, Jiaye Lin, Sen Hu, Zhenheng Tang, Huacan Wang, Ronghao Chen

PDF

Open Access

TL;DR

Octopus introduces a new multimodal reasoning framework that enables autonomous exploration and dynamic capability selection, significantly improving performance across diverse tasks by mimicking human-like reasoning abilities.

Contribution

It defines six core reasoning capabilities, organizes a comprehensive benchmark, and demonstrates a system that autonomously orchestrates these capabilities for improved multimodal reasoning.

Findings

01

Achieves state-of-the-art performance on Octopus-Bench tasks

02

Demonstrates effective autonomous capability exploration and selection

03

Highlights importance of capability coordination in multimodal reasoning

Abstract

Existing multimodal reasoning models and frameworks suffer from fundamental architectural limitations: most lack the human-like ability to autonomously explore diverse reasoning pathways-whether in direct inference, tool-driven visual exploration, programmatic visual manipulation, or intrinsic visual imagination. Consequently, they struggle to adapt to dynamically changing capability requirements in real-world tasks. Meanwhile, humans exhibit a complementary set of thinking abilities when addressing such tasks, whereas existing methods typically cover only a subset of these dimensions. Inspired by this, we propose Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration, a new paradigm for multimodal agentic reasoning. We define six core capabilities essential for multimodal reasoning and organize a comprehensive evaluation benchmark, Octopus-Bench, accordingly. Octopus…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Robot Manipulation and Learning