TL;DR
OpenComputer is a framework that creates verifiable, structured software environments for agents, integrating app-specific verifiers, self-improving verification, task synthesis, and evaluation to enhance reliability and auditability.
Contribution
It introduces a comprehensive, verifier-grounded system for constructing and evaluating verifiable software worlds across multiple desktop applications.
Findings
Verifiers align more closely with human judgment than LLM-based evaluation.
OpenComputer covers 33 applications and 1,000 tasks, demonstrating broad applicability.
Open-source models show significant performance drops compared to OSWorld-Verified scores.
Abstract
We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards. In its current form, OpenComputer covers 33 desktop applications and 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications. Experiments show that OpenComputer's hard-coded verifiers align more closely with human adjudication than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
