RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management

Renqi Chen; Zeyin Tao; Jianming Guo; Jing Wang; Zezhou Xu; Jingzhe Zhu; Qingqing Sun; Tianyi Zhang; Shuai Chen

arXiv:2604.13531·cs.AI·April 16, 2026

RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management

Renqi Chen, Zeyin Tao, Jianming Guo, Jing Wang, Zezhou Xu, Jingzhe Zhu, Qingqing Sun, Tianyi Zhang, Shuai Chen

PDF

TL;DR

RiskWebWorld is a realistic benchmark designed to evaluate GUI agents in high-stakes e-commerce risk management, highlighting current performance gaps and supporting reinforcement learning improvements.

Contribution

We introduce RiskWebWorld, the first comprehensive interactive benchmark for GUI agents in e-commerce risk management, with a scalable infrastructure and evaluation of diverse models.

Findings

01

Top-tier models achieve 49.1% success rate.

02

Specialized open-weights models perform near total failure.

03

Reinforcement learning improves open-source models by 16.2%.

Abstract

Graphical User Interface (GUI) agents show strong capabilities for automating web tasks, but existing interactive benchmarks primarily target benign, predictable consumer environments. Their effectiveness in high-stakes, investigative domains such as authentic e-commerce risk management remains underexplored. To bridge this gap, we present RiskWebWorld, the first highly realistic interactive benchmark for evaluating GUI agents in e-commerce risk management. RiskWebWorld features 1,513 tasks sourced from production risk-control pipelines across 8 core domains, and captures the authentic challenges of risk operations on uncooperative websites, partially environmental hijackments. To support scalable evaluation and agentic reinforcement learning (RL), we further build a Gymnasium-compliant infrastructure that decouples policy planning from environment mechanics. Our evaluation across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.