InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction

Bin Lei; Weitai Kang; Zijian Zhang; Winson Chen; Xi Xie; Shan Zuo; Mimi Xie; Ali Payani; Mingyi Hong; Yan Yan; Caiwen Ding

arXiv:2505.10887·cs.AI·May 4, 2026

InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction

Bin Lei, Weitai Kang, Zijian Zhang, Winson Chen, Xi Xie, Shan Zuo, Mimi Xie, Ali Payani, Mingyi Hong, Yan Yan, Caiwen Ding

PDF

1 Repo 1 Video

TL;DR

InfantAgent-Next is a multimodal generalist agent capable of interacting with computers across text, images, audio, and video, integrating tools and vision models in a modular architecture to solve diverse tasks.

Contribution

The paper presents a novel modular multimodal agent architecture that combines tool-based and vision models, enabling versatile computer interaction and evaluation on various benchmarks.

Findings

01

Achieves 7.27% accuracy on OSWorld benchmark.

02

Successfully evaluates on both vision-based and tool-intensive benchmarks.

03

Open-sourced code and evaluation scripts at GitHub.

Abstract

This paper introduces \textsc{InfantAgent-Next}, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner. Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i.e., OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and SWE-Bench). Specifically, we achieve $7.27%$ accuracy on OSWorld, higher than Claude-Computer-Use. Codes and evaluation scripts are open-sourced at https://github.com/bin123apple/InfantAgent.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bin123apple/InfantAgent
github

Videos

InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction· slideslive