Towards Reliable Evaluation of Large Language Models for Multilingual and Multimodal E-Commerce Applications

Shuyi Xie; Ziqin Liew; Hailing Zhang; Haibo Zhang; Ling Hu; Zhiqiang Zhou; Shuman Liu; Anxiang Zeng

arXiv:2510.20632·cs.AI·October 24, 2025

Towards Reliable Evaluation of Large Language Models for Multilingual and Multimodal E-Commerce Applications

Shuyi Xie, Ziqin Liew, Hailing Zhang, Haibo Zhang, Ling Hu, Zhiqiang Zhou, Shuman Liu, Anxiang Zeng

PDF

Open Access

TL;DR

This paper introduces EcomEval, a comprehensive multilingual and multimodal benchmark for evaluating large language models in e-commerce, addressing limitations of existing benchmarks by including diverse, real-world tasks across multiple languages and modalities.

Contribution

The paper presents EcomEval, a new benchmark with 37 tasks across six categories, incorporating authentic data, expert-reviewed answers, and multilingual, multimodal evaluation for e-commerce LLMs.

Findings

01

EcomEval covers 7 languages, including low-resource ones.

02

Benchmark includes 8 multimodal tasks reflecting real-world scenarios.

03

Evaluation scores enable challenge-oriented assessment across model sizes.

Abstract

Large Language Models (LLMs) excel on general-purpose NLP benchmarks, yet their capabilities in specialized domains remain underexplored. In e-commerce, existing evaluations-such as EcomInstruct, ChineseEcomQA, eCeLLM, and Shopping MMLU-suffer from limited task diversity (e.g., lacking product guidance and after-sales issues), limited task modalities (e.g., absence of multimodal data), synthetic or curated data, and a narrow focus on English and Chinese, leaving practitioners without reliable tools to assess models on complex, real-world shopping scenarios. We introduce EcomEval, a comprehensive multilingual and multimodal benchmark for evaluating LLMs in e-commerce. EcomEval covers six categories and 37 tasks (including 8 multimodal tasks), sourced primarily from authentic customer queries and transaction logs, reflecting the noisy and heterogeneous nature of real business…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Text Readability and Simplification