SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents

Han Li; Vibhor Malik; Zahra Zanjani Foumani; Alberto Castelo; Shuang Xie; Ailin Fan; Keat Yang Koay; Yuanzheng Zhu; Meysam Feghhi; Ronie Uliana; Zhaoyu Zhang; Angelo Ocana Martins; Mingyu Zhao; Francis Pelland; Jonathan Faerman; Nikolas LeBlanc; Aaron Glazer; Andrew McNamara; Zhong Wu; Lingyun Wang

arXiv:2605.19219·cs.AI·May 20, 2026

SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents

Han Li, Vibhor Malik, Zahra Zanjani Foumani, Alberto Castelo, Shuang Xie, Ailin Fan, Keat Yang Koay, Yuanzheng Zhu, Meysam Feghhi, Ronie Uliana, Zhaoyu Zhang, Angelo Ocana Martins, Mingyu Zhao, Francis Pelland, Jonathan Faerman, Nikolas LeBlanc, Aaron Glazer, Andrew McNamara

PDF

TL;DR

SimGym is a framework that simulates A/B testing in e-commerce using vision-language model agents, enabling rapid, traffic-grounded evaluation of UI changes without affecting real users.

Contribution

It introduces a novel simulation framework with traffic-grounded personas and live-browser agents that accurately predict real buyer responses to UI modifications.

Findings

01

Achieves 77% directional alignment with real buyer behavior in A/B tests.

02

Reduces experimental cycle time from weeks to under an hour.

03

Validates simulation accuracy across diverse storefronts and product categories.

Abstract

A/B testing remains the gold standard for evaluating modifications to e-commerce storefronts, yet it diverts traffic, requires weeks to reach statistical significance, and risks degrading user experience. We present SimGym, a framework for simulating A/B tests on e-commerce storefronts using vision-language model (VLM) agents operating in a live browser. The framework comprises three key components: (a) a traffic-grounded persona generation pipeline that derives per-shop buyer archetypes and intents from production clickstream data; (b) a live-browser agent architecture that combines multimodal perception over visual and browser-structured observations with episodic memory and guardrails to conduct coherent shopping sessions across control and treatment storefronts; and (c) an evaluation protocol that compares simulated outcome shifts with observed shifts in real buyer behavior. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.