SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

Yong Xien Chng; Tao Hu; Wenwen Tong; Xueheng Li; Jiandong Chen; Haojia Yu; Jiefan Lu; Hewei Guo; Hanming Deng; Chengjun Xie; Gao Huang; Dahua Lin; Lewei Lu

arXiv:2512.24330·cs.CV·January 27, 2026

SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, Gao Huang, Dahua Lin, Lewei Lu

PDF

Open Access 2 Models 2 Datasets

TL;DR

SenseNova-MARS introduces a reinforcement learning framework that enhances multimodal vision-language models with dynamic visual reasoning and tool use, enabling more human-like complex task solving in knowledge-intensive scenarios.

Contribution

The paper presents SenseNova-MARS, a novel RL-based framework that integrates visual reasoning and external tool invocation for multimodal models, along with a new benchmark for evaluation.

Findings

01

Achieves state-of-the-art results on search and image understanding benchmarks.

02

Effectively integrates visual reasoning with tool use via RL.

03

Outperforms proprietary models on key benchmarks.

Abstract

While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with continuous reasoning, particularly in knowledge-intensive and visually complex scenarios that demand coordinated external tools such as search and image cropping. In this work, we introduce SenseNova-MARS, a novel Multimodal Agentic Reasoning and Search framework that empowers VLMs with interleaved visual reasoning and tool-use capabilities via reinforcement learning (RL). Specifically, SenseNova-MARS dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. In the RL stage, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications