RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier, Yu Su, Zhiqiang Lin, Huan Sun

TL;DR
RedTeamCUA introduces a hybrid sandbox framework for realistic adversarial testing of computer-use agents, revealing significant vulnerabilities and emphasizing the need for improved defenses against prompt injection attacks.
Contribution
It presents RedTeamCUA, a novel hybrid sandbox and benchmark for systematic evaluation of CUA vulnerabilities in web-OS environments, addressing limitations of prior testing methods.
Findings
CUAs show high attack success rates, up to 60%.
Current CUAs often attempt adversarial tasks with high attempt rates.
Vulnerabilities pose tangible risks to users and systems.
Abstract
Computer-use agents (CUAs) promise to automate complex tasks across operating systems (OS) and the web, but remain vulnerable to indirect prompt injection. Current evaluations of this threat either lack support realistic but controlled environments or ignore hybrid web-OS attack scenarios involving both interfaces. To address this, we propose RedTeamCUA, an adversarial testing framework featuring a novel hybrid sandbox that integrates a VM-based OS environment with Docker-based web platforms. Our sandbox supports key features tailored for red teaming, such as flexible adversarial scenario configuration, and a setting that decouples adversarial evaluation from navigational limitations of CUAs by initializing tests directly at the point of an adversarial injection. Using RedTeamCUA, we develop RTC-Bench, a comprehensive benchmark with 864 examples that investigate realistic, hybrid web-OS…
Peer Reviews
Decision·ICLR 2026 Oral
– Proposes a well-designed, hybrid sandbox integrating web and OS layers, bridging realism and safety in adversarial testing. – Builds a large-scale, systematic benchmark (RTC-BENCH) grounded in realistic tasks and security principles (CIA triad). – Provides comprehensive empirical results with both execution-based and LLM-judge metrics, revealing concrete weaknesses in current frontier CUAs. – Conducts thoughtful analysis comparing adapted LLM agents vs. specialized CUAs, and offers valuable
– some closed-source CUAs evaluated (GPT-4o, Claude 3.5/3.7 Sonnet, Claude 4 Opus, Operator) dominate the study; no strong open-source CUAs (e.g., UI-TARS 2, OpenCUA) are included, limiting reproducibility and community relevance. More closed-source CUAs and open-source CUAs need to be included. – The defense evaluation is superficial—existing methods are merely tested rather than extended or improved. – Provides limited mechanistic analysis of why specific CUAs succumb to injection (e.g., rea
- OSWorld backbone allows for hybrid attacks over both OS and web - Realistic threat model - Decoupled evaluation setting is good for helping weaker capability models reach the point of prompt injection, though it might be somewhat un-natural depending on how the tool-calling/traces are hard-coded to the agent history - Great having both Web -> OS -> Web and Web -> OS adversarial scenarios
- The large number of 864 examples is only achieved by cross-product of benign, injection, and instantiation. The number of benign tasks (9) might be too small to accurately estimate agent utility, which is critical in any security benchmark (otherwise a useless agent might have perfect security). - I disagree with the the assessment that the Doomarena threat model is requires full webpage control; even though the authors note that the banners and pop-up attacks are injected into the web page (
- Well written and designed benchmark. - Addresses an important niche of attacks on CUA and hybrid models. - Interesting findings on the limitations of current defense frameworks in this regime.
- Novelty: The method seems to build on top of and combine existing benchmarks and attacks. In a future version, I would like to see the authors spell how their benchmark differs a little more. - Attacks: Only one format of attack was evaluate. Would be nice to see evaluation of more. - Sanitization: How sensitive are results to realistic UI noise (extra messages/files)?
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Software Testing and Debugging Techniques
