ShoppingComp: Are LLMs Really Ready for Your Shopping Cart?

Huaixiao Tou; Ying Zeng; Yuemeng Li; Cong Ma; Muzhi Li; Minghao Li; Weijie Yuan; He Zhang; Kai Jia

arXiv:2511.22978·cs.CL·February 10, 2026

ShoppingComp: Are LLMs Really Ready for Your Shopping Cart?

Huaixiao Tou, Ying Zeng, Yuemeng Li, Cong Ma, Muzhi Li, Minghao Li, Weijie Yuan, He Zhang, Kai Jia

PDF

Open Access 1 Datasets 5 Reviews

TL;DR

ShoppingComp is a new benchmark that rigorously evaluates LLMs on real-world shopping tasks, exposing significant performance gaps in product retrieval, report generation, and safety decision-making.

Contribution

This paper introduces ShoppingComp, a comprehensive and challenging benchmark for assessing LLMs in realistic shopping scenarios, highlighting their current limitations.

Findings

01

State-of-the-art LLMs perform poorly on the benchmark.

02

Core capabilities like information grounding and multi-constraint verification are weak.

03

Current models lack reliable reasoning and risk-aware decision making.

Abstract

We present ShoppingComp, a challenging real-world benchmark for comprehensively evaluating LLM-powered shopping agents on three core capabilities: precise product retrieval, expert-level report generation, and safety critical decision making. Unlike prior e-commerce benchmarks, ShoppingComp introduces difficult product discovery queries with many constraints, while guaranteeing open-world products and enabling easy verification of agent outputs. The benchmark comprises 145 instances and 558 scenarios, curated by 35 experts to reflect authentic shopping needs. Results reveal stark limitations of current LLMs: even state-of-the-art models achieve low performance (e.g., 17.76\% for GPT-5.2, 15.82\% for Gemini-3-Pro).Error analysis reflects limitations in core agent competencies, including information grounding in open-world environments, reliable verification of multi-constraint…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 2Confidence 4

Strengths

- Very relevant and timely problem being considered. - Large data set curated and to be publicly release. - Interesting new tasks considered, especially around safety. - Promising results shown, and some interesting patterns uncovered.

Weaknesses

- Very poorly written at times, too high level. - Handwavy explanations, reads like a quick summary without actual deep explanations. - Evaluation is also high-level and handwavy, without examples to help understand the data the tasks in a deeper manner.

Reviewer 02Rating 4Confidence 4

Strengths

1. Interesting and practical topic. This paper focuses on a valuable and interesting application scenario, namely, LLM-based shopping agents. 2. Empirical takeaways from the experiments. The paper conducts large-scale experiments to highlight the bottlenecks of LLM-based agents in real-world deployment. 3. Well-defined scoring dimensions. The scoring includes three parts of evaluation metrics, which provide a comprehensive foundation for future research.

Weaknesses

1. Dataset accessibility and details. The paper promises future open-sourcing but provides no current release, subset, or several appendix samples; this undermines claims of reproducibility for a benchmark/dataset paper. 2. Potential evaluation bias risk. Gemini-2.5-Pro is used as both the LLM evaluator and the evaluation target, raising potential bias concerns (existing research has revealed that LLM evaluators would prefer the results generated by themselves) 3. Poor organization. As a dataset

Reviewer 03Rating 4Confidence 4

Strengths

The paper presents a well-designed and highly challenging benchmark. Its preliminary findings provide valuable insights into the real-world limitations of current Large Language Models (LLMs), making a timely contribution to the field of agent evaluation.

Weaknesses

1.The primary contribution leans more towards empirical findings rather than methodological innovation. 2.The Y-axis in Figure 4 lacks a clear label, which hinders the interpretation of the chart. 3.There is an inconsistency in the reported performance for GPT-5. The "Conclusion and Future Work" section states it "reaches only 19.6%", which appears to conflict with the data presented in Table 2. Please clarify or correct this discrepancy.

Reviewer 04Rating 6Confidence 3

Strengths

1. This research is interesting and valuable. The authors propose a benchmark for evaluating LLM-driven shopping agents in real tasks and scenarios, which is no longer limited to academic benchmarks but focuses more on users' actual needs and shopping experience. This provides a real application direction and reference for related research fields. 2. Besides evaluating the LLM agent's product retrieval ability and report generation quality, the authors have also expanded with a safety decision r

Weaknesses

1. There is a lack of introduction about the proposed benchmark containing 120 tasks and 1026 real scenarios, how this benchmark is composed, and how it is organized in subsequent testing. This point is not clearly explained. 2. The description of the data collection process in Section 3.2 is insufficient. Readers cannot understand the specific format and presentation of the data. Examples should be provided, for instance in the form of figures or tables. These can be provided in supplementary m

Reviewer 05Rating 2Confidence 4

Strengths

A very good written paper

Weaknesses

Actually the topic of this paper is in my research area. I have kind of feeling that the paper did not meet the quality as a ICLR paper.

Code & Models

Datasets

huaixiao/ShoppingComp
dataset· 37 dl
37 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Semantic Web and Ontologies