Evolving in Tasks: Empowering the Multi-modality Large Language Model as the Computer Use Agent

Yuhao Cheng; Liang Tang; Shuxian Li; Yukang Huo; Tiaonan Duan; Kaer Huang; Yanzhe Jing; Yiqiang Yan

arXiv:2508.04037·cs.AI·January 23, 2026

Evolving in Tasks: Empowering the Multi-modality Large Language Model as the Computer Use Agent

Yuhao Cheng, Liang Tang, Shuxian Li, Yukang Huo, Tiaonan Duan, Kaer Huang, Yanzhe Jing, Yiqiang Yan

PDF

TL;DR

This paper introduces the Self-Evolution Agent (SEA), a multi-modality large language model designed for autonomous computer operation, achieving high performance with innovative data, reinforcement learning, and model enhancement techniques.

Contribution

The paper presents a novel SEA model with automatic data generation, efficient reinforcement learning, and integrated grounding and planning, enabling effective computer use tasks at a smaller parameter scale.

Findings

01

SEA outperforms similar-sized models on computer tasks

02

Achieves performance comparable to larger models (32B/72B parameters)

03

Introduces efficient step-wise reinforcement learning for long-horizon tasks

Abstract

Computer use agents represent an emerging area in artificial intelligence, aiming to operate computers autonomously to fulfill user tasks, attracting significant attention from both industry and academia. However, the performance of existing agents remains insufficient for practical deployment. In this paper, we propose the Self-Evolution Agent (SEA) for computer operation, alongside three core innovations in data generation, reinforcement learning, and model enhancement to develop this agent. Specifically, we first design an automatic pipeline to generate verifiable task trajectories for training. Second, we propose Efficient Step-wise Reinforcement Learning to reduce the substantial computational overhead of long-horizon training. Finally, we introduce a model enhancement method that integrates grounding and planning capabilities into a single model without additional training.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.