ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

Amirhossein Abaskohi; Yuhang He; Peter West; Giuseppe Carenini; Pranit Chawla; Vibhav Vineet

arXiv:2605.11212·cs.CL·May 14, 2026

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

Amirhossein Abaskohi, Yuhang He, Peter West, Giuseppe Carenini, Pranit Chawla, Vibhav Vineet

PDF

TL;DR

ReVision introduces a learned patch selector to reduce visual redundancy in graphical user interface trajectories, significantly decreasing token usage and enabling longer history processing for computer-use agents.

Contribution

It presents a novel method for visual redundancy reduction that improves efficiency and performance in multimodal language models for GUI-based agents.

Findings

01

ReVision reduces token usage by approximately 46% on average.

02

Success rate improves by 3% over the no drop baseline.

03

Performance continues to improve with more past observations when redundancy is removed.

Abstract

Computer-use agents (CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains. We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.