VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use
Zhehao Zhang, Ryan Rossi, Tong Yu, Franck Dernoncourt, Ruiyi Zhang, Jiuxiang Gu, Sungchul Kim, Xiang Chen, Zichao Wang, Nedim Lipka

TL;DR
VipAct introduces a multi-agent framework that significantly improves fine-grained visual perception in vision-language models by integrating specialized agents and tool use for detailed reasoning.
Contribution
The paper presents VipAct, a novel multi-agent system that enhances VLMs' visual perception through collaboration and specialized perceptual tools, addressing limitations in detailed pixel-level analysis.
Findings
Significant performance improvements on diverse visual perception benchmarks.
Multi-agent collaboration enhances detailed reasoning capabilities.
Ablation studies confirm the importance of tool use and planning in perception tasks.
Abstract
While vision-language models (VLMs) have demonstrated remarkable performance across various tasks combining textual and visual information, they continue to struggle with fine-grained visual perception tasks that require detailed pixel-level analysis. Effectively eliciting comprehensive reasoning from VLMs on such intricate visual elements remains an open challenge. In this paper, we present VipAct, an agent framework that enhances VLMs by integrating multi-agent collaboration and vision expert models, enabling more precise visual understanding and comprehensive reasoning. VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks such as image captioning and vision expert models that provide high-precision perceptual information. This multi-agent approach allows VLMs to better…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The VIPACT framework’s use of an orchestrator agent that coordinates with specialized agents and vision expert models stands out, as it improves VLMs' performance on fine-grained visual perception tasks by enabling collaborative reasoning. This structured, modular approach allows for flexibility and extensibility, making it adaptable for a wide range of tasks. 2. By employing System-2 reasoning, VIPACT goes beyond traditional VLM capabilities, integrating intermediate reasoning steps that are
1. VIPACT relies heavily on models like GPT-4o for their advanced instruction-following and function-calling abilities. While the framework is adaptable, current results may not generalize effectively to other VLMs lacking these specific capabilities, restricting the framework's accessibility and broader applicability. 2. In MMVP benchmark, how does the proposed model compared with other vision foundation model like llava, internvl, eagle and so on? Eagle also use multi-expert collaboration, ca
1. The paper is mainly well-written and easy to follow. 2. The paper proposes a new multi-agent framework for fine-grained visual perception tasks. 3. The proposed framework is effective and improves the performance of two datasets over existing baselines.
1. The framework is tested on only one LLM. Testing on more, e.g., Claude / Gemini, would be more convincing and show the generalization of the framework. 2. The method proposed has limited contribution to the community. There have been multiple papers proposing/applying "LLM with visual tool use" to solve vision tasks. The multi-agent framework has also been verified to be effective on various downstream tasks. 3. The performance of the proposed framework does not seem significant enough. On t
1. The authors utilized a VLM-based programmatic framework to perform difficult vision tasks challenging for most existing VLMs. 2. The description of the ViaAct pipeline is clear and easy to understand. 3. The experiments are detailed and comprehensive.
1. It is still unclear how including the input image in the prompt to the orchestrator agent improves the fine-grained visual perception capability, although the experimental results show that removing this visual input leads to a performance decrease. At the beginning of the paper, the authors stated that "recent studies reveal that state-of-the-art (SOTA) VLMs continue to struggle with fine-grained or low-level visual perception tasks that are trivial for humans" (L43). Based on this premise,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTactile and Sensory Interactions · Data Visualization and Analytics · Retinal Imaging and Analysis
MethodsSparse Evolutionary Training
