Same Outcomes, Different Journeys: A Trace-Level Framework for Comparing Human and GUI-Agent Behavior in Production Search Systems
Maria Movin, Claudia Hauff, Aron Henriksson, Panagiotis Papapetrou

TL;DR
This paper introduces a trace-level framework for comparing human and GUI-agent behaviors in production search systems, revealing similarities in outcomes but differences in navigation strategies.
Contribution
The paper presents a novel evaluation framework that analyzes detailed behavioral traces, highlighting behavioral differences despite similar task success rates.
Findings
Agents achieve task success comparable to humans.
Agents generate broadly similar queries to humans.
Agents follow different navigation strategies, being more search-centric.
Abstract
LLM-driven GUI agents are increasingly used in production systems to automate workflows and simulate users for evaluation and optimization. Yet most GUI-agent evaluations emphasize task success and provide limited evidence on whether agents interact in human-like ways. We present a trace-level evaluation framework that compares human and agent behavior across (i) task outcome and effort, (ii) query formulation, and (iii) navigation across interface states. We instantiate the framework in a controlled study in a production audio-streaming search application, where 39 participants and a state-of-the-art GUI agent perform ten multi-hop search tasks. The agent achieves task success comparable to participants and generates broadly aligned queries, but follows systematically different navigation strategies: participants exhibit content-centric, exploratory behavior, while the agent is more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
