Coding Agents with Multimodal Browsing are Generalist Problem Solvers
Aditya Bharat Soni, Boxuan Li, Xingyao Wang, Valerie Chen, Graham Neubig

TL;DR
OpenHands-Versa is a versatile AI agent utilizing a minimal set of general tools like code editing, web search, and multimodal browsing, achieving high performance across diverse benchmarks and surpassing specialized agents.
Contribution
This work introduces OpenHands-Versa, a generalist agent with a limited toolkit that outperforms specialized agents on multiple challenging benchmarks.
Findings
OpenHands-Versa outperforms previous best results on three benchmarks.
A minimal set of general tools can achieve high task performance.
Existing multi-agent systems lack generalization beyond their specific domains.
Abstract
Modern human labor is characterized by specialization; we train for years and develop particular tools that allow us to perform well across a variety of tasks. In addition, AI agents have been specialized for domains such as software engineering, web navigation, and workflow automation. However, this results in agents that are good for one thing but fail to generalize beyond their intended scope. One reason for this is that agent developers provide a highly specialized set of tools or make architectural decisions optimized for a specific use case or benchmark. In this work, we ask the question: what is the minimal set of general tools that can be used to achieve high performance across a diverse set of tasks? Our answer is OpenHands-Versa, a generalist agent built with a modest number of general tools: code editing and execution, web search, as well as multimodal web browsing and file…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMulti-Agent Systems and Negotiation
MethodsSparse Evolutionary Training
