VisionReasoner: Unified Reasoning-Integrated Visual Perception via Reinforcement Learning
Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, Jiaya Jia

TL;DR
VisionReasoner is a unified reinforcement learning framework that enhances visual perception tasks by integrating reasoning, achieving superior performance across detection, segmentation, and counting tasks without relying on annotated reasoning data.
Contribution
The paper introduces VisionReasoner, a novel unified model that combines reasoning with visual perception tasks using reinforcement learning and a shared reward mechanism.
Findings
Outperforms baseline Qwen2.5VL on multiple perception tasks
Achieves 29.1% improvement on COCO detection
Demonstrates faithful and reliable reasoning without annotated data
Abstract
Large vision-language models exhibit inherent capabilities to handle diverse visual perception tasks. In this paper, we introduce VisionReasoner, a unified framework capable of reasoning and solving multiple visual perception tasks within a shared model. Specifically, by designing a unified reward mechanism and multi-object cognitive learning strategies, VisionReasoner enhances its reasoning capabilities to analyze visual inputs, and addresses diverse perception tasks within a unified model. VisionReasoner generates a structured reasoning process before delivering the desired outputs responding to user queries. Human evaluation reveals the reasoning process of VisionReasoner is faithful and reliable even without annotated reasoning train data. To rigorously assess unified visual perception capabilities, we evaluate VisionReasoner on ten diverse tasks spanning three critical domains:…
Peer Reviews
Decision·ICLR 2026 Poster
1. The proposed framework is straightforward and clear to understand -- usage of RL to postrain a VLM for 3 core perception tasks with multiple rewards to capture the perception tasks and reasoning lengths. 2. The results show clear improvements of RL postraining (using GRPO objective) in improving results on appropriate benchmarks. 3. Experiments include ablations and analysis to understand the method.
1. The authors state that previous methods employ RL in a task-specific manner and utilize distinct reward functions for different tasks. However, in my opinion, authors in their work also seem to employ task-specific rewards for detection and point matching in addition to format rewards. 2. A zero-shot chain-of-thought prompted baseline should be present as without it, it is currently unclear whether just directly prompting model to think step-by-step or breakdown the prompt can also be suffic
1. The proposed architecture (planner + executor + UIR) offers a principled way to unify reasoning over different modalities and domains. 2. The system generalizes across unseen reasoning tasks and domains with minimal or no task-specific supervision.
1. The strongest recent baselines use retrieval-augmented reasoning where structured planning is implicit. The authors only compare with older systems like RAML and ReGrouP, not modern VLMs fine-tuned with in-context reasoning. 2. The planner is the backbone of VisionReasoner, yet the accuracy of program generation is not reported. The author shall consider to include planner-only accuracy (e.g., execution success rate, semantic match with gold reasoning paths). 3. The UIR uses a limited set of
1. Unified multi-task framework design. The paper successfully constructs a unified framework capable of handling three major categories of visual perception tasks—detection, segmentation, and counting—simultaneously. This unified design offers several notable advantages. 2. Outstanding data efficiency and scalability. With only 7,000 training samples, the VisionReasoner-7B model achieves strong performance, demonstrating impressive data efficiency and generalization capability.
1. The experimental evaluation could benefit from a broader and more up-to-date set of baseline models. The paper mainly compares VisionReasoner with Shikra and Qwen2.5-VL; however, Shikra, as an early work from 2023, may not fully reflect the current progress of large vision-language models (LVLMs). Expanding the comparison to include more recent LVLMs could provide a fairer and more comprehensive assessment of VisionReasoner’s performance in the current landscape. 2. Some implementation aspec
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection
