EcomMMMU: Strategic Utilization of Visuals for Robust Multimodal E-commerce Models

Xinyi Ling; Hanwen Du; Zhihui Zhu; Xia Ning

arXiv:2508.15721·cs.CL·November 14, 2025

EcomMMMU: Strategic Utilization of Visuals for Robust Multimodal E-commerce Models

Xinyi Ling, Hanwen Du, Zhihui Zhu, Xia Ning

PDF

Open Access 1 Datasets

TL;DR

This paper introduces EcomMMMU, a large-scale multimodal e-commerce dataset, revealing that product images can sometimes hinder performance, and proposes SUMEI, a method to strategically utilize images for improved task outcomes.

Contribution

The paper presents EcomMMMU, a comprehensive dataset for multimodal e-commerce understanding, and introduces SUMEI, a novel approach to optimize the use of visual content in large language models.

Findings

01

Product images do not always improve performance.

02

Images can sometimes degrade model accuracy.

03

SUMEI effectively leverages multiple images for better results.

Abstract

E-commerce platforms are rich in multimodal data, featuring a variety of images that depict product details. However, this raises an important question: do these images always enhance product understanding, or can they sometimes introduce redundancy or degrade performance? Existing datasets are limited in both scale and design, making it difficult to systematically examine this question. To this end, we introduce EcomMMMU, an e-commerce multimodal multitask understanding dataset with 406,190 samples and 8,989,510 images. EcomMMMU is comprised of multi-image visual-language data designed with 8 essential tasks and a specialized VSS subset to benchmark the capability of multimodal large language models (MLLMs) to effectively utilize visual content. Analysis on EcomMMMU reveals that product images do not consistently improve performance and can, in some cases, degrade it. This indicates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

NingLab/EcomMMMU
dataset· 92 dl
92 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis