See, Think, Act: Online Shopper Behavior Simulation with VLM Agents
Yimeng Zhang, Jiri Gesi, Ran Xue, Tian Wang, Ziyi Wang, Yuxuan Lu, Sinong Zhan, Huimin Zeng, Qingjun Cui, Yufan Guo, Jing Huang, Mubarak Shah, Dakuo Wang

TL;DR
This paper enhances online shopper behavior simulation by integrating visual webpage data with language models, significantly improving prediction accuracy and realism in complex, visually rich environments.
Contribution
It introduces a multi-modal approach combining visual and textual data for behavior simulation, advancing beyond text-only models with hierarchical reinforcement learning.
Findings
Visual grounding improves accuracy by over 6%.
Multi-modal inputs enhance simulation fidelity.
Hierarchical RL prioritizes challenging decisions.
Abstract
LLMs have recently demonstrated strong potential in simulating online shopper behavior. Prior work has improved action prediction by applying SFT on action traces with LLM-generated rationales, and by leveraging RL to further enhance reasoning capabilities. Despite these advances, current approaches rely on text-based inputs and overlook the essential role of visual perception in shaping human decision-making during web GUI interactions. In this paper, we investigate the integration of visual information, specifically webpage screenshots, into behavior simulation via VLMs, leveraging OPeRA dataset. By grounding agent decision-making in both textual and visual modalities, we aim to narrow the gap between synthetic agents and real-world users, thereby enabling more cognitively aligned simulations of online shopping behavior. Specifically, we employ SFT for joint action prediction and…
Peer Reviews
Decision·Submitted to ICLR 2026
(1) Thoughtful reward design for both fine and coarse-grained tasks, with structured and difficulty-aware rewards. (2) Comprehensive evaluation metrics including exact match and F1. (3) Strong text–vision alignment design, e.g., “We further prune the HTML structure by retaining only elements visible within the current viewport, reducing noise and aligning textual and visual modalities.” which makes a clear integration of visual context that improves grounding and simulation realism. (4) Compr
(1) Inconsistent abbreviation use: some acronyms are redefined after being introduced earlier. (2) Limited real human rationales make the “human-aligned” outputs only partially aligned with actual human reasoning. How limited is the number of true human rationales? (3) The joint prediction formulation (rationale + action) is questionable since many rationales are synthetic, potentially weakening alignment between reasoning and behavior. (4) The statement: “Rationale generation and action pred
1. Timely and well-motivated research direction: Incorporating visual perception into behavior simulation addresses a real gap in existing text-only approaches; GUI agent/WebNav Agent are quite trendy. 2. Reasonable experimental setup: The paper provides a clear comparison across modalities (text+image, text-only, image-only) and training schemes (zero-shot, SFT, SFT+RL). 3. Honest discussion of limitations: Section 5 provides valuable reflections on methodological constraints, including actio
In general, the novelty of this paper is very weak compared with existing VLM-based Agent work. If you look into the leaderboard of the online Mind2Web, Webarena, all of these agents adopt a very similar training pipeline. The authors didn't really understand or solve the bottleneck in this area. 1. Severely limited dataset scale and generalizability: The paper uses only 692 sessions from 51 users, resulting in just 8,212 training samples and 1,508 test samples after splitting (Table 1). This
Strengths: - interesting, useful, and timely problem (behavior simulation) - positive improvement on a benchmark - well-written and clearly presented paper
Primary weakness: limited contribution and lack of depth Given that the OPeRA dataset already includes screenshots, the main addition of this paper is preprocessing and filtering, which may be useful as practical implementation details but as presented, doesn’t offer new insight. It’s quite obvious that adding visual information would help, so the scientific contribution should come from understanding how and why, not just showing a small improvement. I’m not opposed to papers that pursue an “
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Artificial Intelligence in Games
