Evolving in Tasks: Empowering the Multi-modality Large Language Model as the Computer Use Agent
Yuhao Cheng, Liang Tang, Shuxian Li, Yukang Huo, Tiaonan Duan, Kaer Huang, Yanzhe Jing, Yiqiang Yan

TL;DR
This paper introduces the Self-Evolution Agent (SEA), a multi-modality large language model designed for autonomous computer operation, achieving high performance with innovative data, reinforcement learning, and model enhancement techniques.
Contribution
The paper presents a novel SEA model with automatic data generation, efficient reinforcement learning, and integrated grounding and planning, enabling effective computer use tasks at a smaller parameter scale.
Findings
SEA outperforms similar-sized models on computer tasks
Achieves performance comparable to larger models (32B/72B parameters)
Introduces efficient step-wise reinforcement learning for long-horizon tasks
Abstract
Computer use agents represent an emerging area in artificial intelligence, aiming to operate computers autonomously to fulfill user tasks, attracting significant attention from both industry and academia. However, the performance of existing agents remains insufficient for practical deployment. In this paper, we propose the Self-Evolution Agent (SEA) for computer operation, alongside three core innovations in data generation, reinforcement learning, and model enhancement to develop this agent. Specifically, we first design an automatic pipeline to generate verifiable task trajectories for training. Second, we propose Efficient Step-wise Reinforcement Learning to reduce the substantial computational overhead of long-horizon training. Finally, we introduce a model enhancement method that integrates grounding and planning capabilities into a single model without additional training.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
