DeepShop: A Benchmark for Deep Research Shopping Agents

Yougang Lyu; Xiaoyu Zhang; Lingyong Yan; Maarten de Rijke; Zhaochun Ren; Xiuying Chen

arXiv:2506.02839·cs.IR·June 4, 2025

DeepShop: A Benchmark for Deep Research Shopping Agents

Yougang Lyu, Xiaoyu Zhang, Lingyong Yan, Maarten de Rijke, Zhaochun Ren, Xiuying Chen

PDF

Open Access 3 Reviews

TL;DR

DeepShop introduces a comprehensive benchmark for evaluating web shopping agents in complex, realistic scenarios involving diverse queries, product attributes, filters, and sorting, highlighting current methods' limitations.

Contribution

This paper presents DeepShop, a novel benchmark that simulates real-world shopping complexity for evaluating and advancing web shopping agents.

Findings

01

RAG methods struggle with complex queries due to lack of web interaction.

02

Web agents face challenges with filters and sorting preferences.

03

Overall success rates remain low across evaluated methods.

Abstract

Web agents for online shopping have shown great promise in automating user interactions across e-commerce platforms. Benchmarks for assessing such agents do not reflect the complexity of real-world shopping scenarios, as they often consist of overly simple queries with deterministic paths, such as "Find iPhone 15." Real shopping scenarios are inherently more layered, involving multi-dimensional product attributes, search filters, and user-specific sorting preferences. To address this gap, we introduce DeepShop, a benchmark designed to evaluate web agents in complex and realistic online shopping environments. DeepShop comprises three key components. (1) Query diversity evolution: Starting from real user queries, we generate diverse queries across five popular online shopping domains. (2) Query complexity evolution: We further evolve these queries to increase complexity, considering…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

1. The paper identifies a significant limitation in existing web agent benchmarks. The focus on simple, deterministic tasks does not adequately test an agent's ability to handle the complex, multi-constraint queries that are common in real-world e-commerce. The proposed benchmark is on the right track of briging this gap of evaluating more advanced agent that has real-world application values. 2. The proposed fine-grained evaluation framework is a key strength. It allows researchers to move bey

Weaknesses

1. The paper lack crutial details on environment and reproducibility. While offline benchmarks are static and simpler, they offer a unique advantage in controllability and reproducibility over online enviroments. A core challenge of an online benchmark is the dynamic nature of websites (e.g., prices change, review counts increase, items go out of stock). The paper does not adequately address how the benchmark's ground truth is maintained. How can an agent's success on a query like "at least 300

Reviewer 02Rating 2Confidence 3

Strengths

1. Overall, the paper is pretty easy to read and understand 2. As far as I can tell, the task has a reasonable amount of headroom (cf. Table 2), suggesting that this benchmark might stay relevant for longer than more synthetic alternatives like WebShop

Weaknesses

1. The paper may be of limited interest to researchers who are not specifically interested in web shopping agents. Much of its analysis is about, e.g., differences in agent performance across different shopping domains (e.g., books vs. fashion), and doesn’t seem applicable to web navigation or agent research more broadly. My impression is that the tasks described here are roughly a subset of those described in more general-purpose benchmarks like AssistantBench 2. One major concern I have is wh

Reviewer 03Rating 6Confidence 2

Strengths

Novel and practical benchmark: DeepShop fills an important gap by testing systems on real-time, constraint-heavy shopping tasks that mirror real user behavior. Systematic query evolution: The staged expansion from simple to complex queries is methodologically sound and produces controllable difficulty tiers. Granular evaluation: Separating fine-grained metrics (attributes, filters, sorting) from holistic success provides diagnostic insights that go beyond aggregate scores. Comprehensive cover

Weaknesses

Limited task realism in some cases: Although queries evolve in complexity, the benchmark still relies on synthetic LLM-generated text rather than actual user intent logs, which may limit ecological validity. Evaluation dependency on GPT-4o: Automated judging introduces potential bias, especially since GPT-4o is also part of the evaluated systems. Unclear generalization beyond e-commerce: The authors claim DeepShop can inspire broader web-agent research, but no experiments are shown outside sho

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBlockchain Technology Applications and Security · IoT and Edge/Fog Computing · Privacy-Preserving Technologies in Data