Programming with Pixels: Can Computer-Use Agents do Software Engineering?
Pranjal Aggarwal, Sean Welleck

TL;DR
This paper introduces Programming with Pixels (PwP), a comprehensive environment and benchmark for evaluating computer-use agents on complex software engineering tasks, revealing current limitations and potential improvements.
Contribution
It presents PwP and PwP-Bench, enabling holistic evaluation of CUAs on diverse SWE tasks and analyzing their performance with different API access levels.
Findings
Visual-only CUAs perform poorly on SWE tasks.
API access significantly improves CUA performance.
Additional IDE tools further enhance model capabilities.
Abstract
Computer-use agents (CUAs) hold the promise of performing a wide variety of general tasks, but current evaluations have primarily focused on simple scenarios. It therefore remains unclear whether such generalist agents can automate more sophisticated and specialized work such as software engineering (SWE). To investigate this, we introduce (PwP), the first comprehensive computer-use environment for software engineering, where agents visually control an IDE to perform diverse software engineering tasks. To enable holistic evaluation, we also introduce \texttt{PwP-Bench}, a benchmark of 15 existing and new software-engineering tasks spanning multiple modalities, programming languages, and skillsets. We perform an extensive evaluation of state-of-the-art open-weight and closed-weight CUAs and find that when interacting purely visually, they perform…
Peer Reviews
Decision·ICLR 2026 Poster
Addresses an important intersection between two major agent paradigms — IDE-based (CUA) and API-based (SWE) — by providing a unified testbed for fair comparison. Novel environment design: the first fully interactive VSCode-based setup enabling realistic software-engineering evaluation. Comprehensive benchmark: integrates 15 diverse SWE tasks across modalities and languages. Insightful analysis: identifies visual grounding and tool-usage limitations as key failure modes, offering concrete dire
Only one specialized SWE baseline (Mini-SWEAgent) is included; broader baselines would strengthen comparisons. Conceptual clarity: definitions of computer-use agent vs SWE agent and the rationale for “fair evaluation” (why expressiveness and access matter) could be better explained. PwP-Bench-Lite covers many task types but has very few instances (20 per task), and the paper reports no repeated trials or variance, raising robustness concerns. Technically, the benchmark adapts existing dataset
1. The provided benchmark is very valuable, and its container-based approach makes evaluations consistent. 2. The provided VSCode IDE that returns screenshots with set-of-marks and DOM is valuable and can be used in future research. 3. The evaluation of various computer-use agents is comprehensive. 4. The ablation studies are also comprehensive, and the results on how much the text-based interface vs. the visual interface is used are very interesting.
1. To measure the importance of the visual interface, it would be interesting to measure the performance when agents only have access to text-based tools (file edit, terminal, and task-specific tools). It would show the computer-use agents' ability in text-only settings. 2. Some tasks aren’t relevant to agentic software development. For example, changing the theme of the IDE is not that important, while similar IDE setting changes, like turning on auto-suggestion, could be helpful. 3. Some deep
The PwP environment addresses a critical gap in evaluating computer-use agents for software engineering tasks. By providing a realistic IDE interface through VSCode, the environment enables comprehensive testing of agent capabilities while maintaining the expressiveness needed for diverse software engineering activities. The extensible design, with features like checkpointing, multimodal support, and easy benchmark addition, positions it as a lasting contribution to the research community. The
The following are key issues identified in the work - The reliance on PwP-Bench-Lite (300 instances) due to computational constraints creates potential sampling bias, potentially missing important edge cases that would emerge in full-scale evaluation. The 20-step limitation, while computationally necessary, may artificially constrain performance on genuinely complex software engineering tasks, though the supplementary 250-step experiments provide partial validation. - The heavy focus on Claude
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Software Engineering Methodologies · Multi-Agent Systems and Negotiation
