ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
Amirhossein Abaskohi, Yuhang He, Peter West, Giuseppe Carenini, Pranit Chawla, Vibhav Vineet

TL;DR
ReVision introduces a learned patch selector to reduce visual redundancy in graphical user interface trajectories, significantly decreasing token usage and enabling longer history processing for computer-use agents.
Contribution
It presents a novel method for visual redundancy reduction that improves efficiency and performance in multimodal language models for GUI-based agents.
Findings
ReVision reduces token usage by approximately 46% on average.
Success rate improves by 3% over the no drop baseline.
Performance continues to improve with more past observations when redundancy is removed.
Abstract
Computer-use agents (CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains. We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
