Human-Guided Harm Recovery for Computer Use Agents
Christy Li, Sky CH-Wang, Andi Peng, Andreea Bobu

TL;DR
This paper introduces a human-aligned harm recovery framework for computer agents, combining user preferences, a new benchmark, and reward modeling to improve post-harm remediation.
Contribution
It formalizes harm recovery as a post-execution safeguard, develops a dataset and reward model for preference-aligned recovery, and introduces BackBench, a benchmark for evaluating recovery from harmful states.
Findings
Reward model improves recovery quality over base agents.
User preferences influence recovery strategies and priorities.
BackBench benchmark enables systematic evaluation of harm recovery methods.
Abstract
As LM agents gain the ability to execute actions on real computer systems, we need ways to not only prevent harmful actions at scale but also effectively remediate harm when prevention fails. We formalize a solution to this neglected challenge in post-execution safeguards as harm recovery: the problem of optimally steering an agent from a harmful state back to a safe one in alignment with human preferences. We ground preference-aligned recovery through a formative user study that identifies valued recovery dimensions and produces a natural language rubric. Our dataset of 1,150 pairwise judgments reveals context-dependent shifts in attribute importance, such as preferences for pragmatic, targeted strategies over comprehensive long-term approaches. We operationalize these learned insights in a reward model, re-ranking multiple candidate recovery plans generated by an agent scaffold at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
