Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models
Yilun Jin, Zheng Li, Chenwei Zhang, Tianyu Cao, Yifan Gao, Pratik, Jayarao, Mao Li, Xin Liu, Ritesh Sarkhel, Xianfeng Tang, Haodong Wang,, Zhengyang Wang, Wenju Xu, Jingfeng Yang, Qingyu Yin, Xian Li, Priyanka Nigam,, Yi Xu, Kai Chen, Qiang Yang, Meng Jiang, Bing Yin

TL;DR
Shopping MMLU is a comprehensive benchmark derived from real-world Amazon data, designed to evaluate large language models' multi-task online shopping capabilities across diverse skills and languages.
Contribution
It introduces a new multi-task benchmark with 57 tasks for evaluating LLMs in online shopping, covering diverse skills and real-world data, and hosts a related competition.
Findings
Benchmarking reveals strengths and weaknesses of existing LLMs in online shopping tasks.
The benchmark facilitates the development of more versatile and effective shop assistant models.
Insights from the competition guide future research in LLM-based e-commerce applications.
Abstract
Online shopping is a complex multi-task, few-shot learning problem with a wide and evolving range of entities, relations, and tasks. However, existing models and benchmarks are commonly tailored to specific tasks, falling short of capturing the full complexity of online shopping. Large Language Models (LLMs), with their multi-task and few-shot learning abilities, have the potential to profoundly transform online shopping by alleviating task-specific engineering efforts and by providing users with interactive conversations. Despite the potential, LLMs face unique challenges in online shopping, such as domain-specific concepts, implicit knowledge, and heterogeneous user behaviors. Motivated by the potential and challenges, we propose Shopping MMLU, a diverse multi-task online shopping benchmark derived from real-world Amazon data. Shopping MMLU consists of 57 tasks covering 4 major…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsRecommender Systems and Techniques
