Efficient Agent Training for Computer Use
Yanheng He, Jiahe Jin, Pengfei Liu

TL;DR
This paper presents PC Agent-E, an efficient training framework for computer use agents that reduces dependence on large human datasets by synthesizing diverse trajectories, leading to significant performance improvements.
Contribution
Introduces PC Agent-E, a novel framework that combines limited human data with AI-generated trajectories to train superior computer use agents.
Findings
Achieved 141% relative improvement over baseline.
Surpassed Claude 3.7 Sonnet by 10% on WindowsAgentArena-V2.
Reduced reliance on large-scale human demonstrations.
Abstract
Scaling up high-quality trajectory data has long been a critical bottleneck for developing human-like computer use agents. We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations. Starting with just 312 human-annotated computer use trajectories, we further augment them by synthesizing diverse alternative action decisions with Claude 3.7 Sonnet. Trained on these enriched trajectories, our PC Agent-E model achieved a remarkable 141 relative improvement, and even surpassed the Claude 3.7 Sonnet by 10% in relative terms on WindowsAgentArena-V2, an improved benchmark we also released. By integrating robust human computer use skills with automated AI data synthesis capabilities, our method not only brought substantial improvements over training on human trajectories alone, but also significantly surpassed direct…
Peer Reviews
Decision·ICLR 2026 Poster
- Relevant problem of efficiently using small sets of expensive human data traces
- Evaluation only on a single and self-modified benchmark - What about other benchmarks like OSWorld? - Direct distillation from Claude not a fair baseline. It should be human data + direct distillation from Claude.
1. The proposed approach is a nice way to train performant agents under limited data regime where we can collect small amounts of human demonstrations for each task and augment them synthetically to scale up demonstrations. 2. The additional contribution of a new evaluation benchmark WindowsAgentArena-v2 is valuable to the community and helps improve the evaluation benchmarks overall by fixing existing issues. 3. The analysis presented in section 5.3 is quite insightful. It highlights the fact t
1. The caption of table 2 has a typo in difference between performance of teacher model Claude Sonnet 3.7 vs PCAgent-E. The text says the difference is 10% whereas it is ~4%. Authors need to fix this issue in the claims. 2. The test-time scaling results presented in section 5.5 seem incomplete to me. For main experiments in table 1 authors use max 30 step limit but in test time scaling experiments the two values use for step limit is 15 and 30 which seems counter intuitive. I request the authors
* **Better benchmark signal (quality & significance).** Through human verification and fixes, the paper delivers a cleaner **WindowsAgentArena-V2** benchmark that reduces evaluation pathologies and provides clearer training/evaluation signals for computer-use agents. * **Strong results with little data (originality & significance).** Despite limited human trajectories, the proposed training pipeline achieves **state-of-the-art performance on WindowsAgentArena**, allowing an open-source mode
* **Efficient training lacks out-of-domain generalization.** While the method reaches ~35% on WindowsAgentArena (on par with Claude 3.7), it drops to **14.9% on OSWorld**, whereas **Claude 3.7 is ~35%** under a 50-step cap. This gap indicates the approach chiefly captures Windows-specific patterns rather than learning transferable skills—i.e., likely **in-domain overfitting** despite the “efficient training” claim.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation
