EcomMMMU: Strategic Utilization of Visuals for Robust Multimodal E-commerce Models
Xinyi Ling, Hanwen Du, Zhihui Zhu, Xia Ning

TL;DR
This paper introduces EcomMMMU, a large-scale multimodal e-commerce dataset, revealing that product images can sometimes hinder performance, and proposes SUMEI, a method to strategically utilize images for improved task outcomes.
Contribution
The paper presents EcomMMMU, a comprehensive dataset for multimodal e-commerce understanding, and introduces SUMEI, a novel approach to optimize the use of visual content in large language models.
Findings
Product images do not always improve performance.
Images can sometimes degrade model accuracy.
SUMEI effectively leverages multiple images for better results.
Abstract
E-commerce platforms are rich in multimodal data, featuring a variety of images that depict product details. However, this raises an important question: do these images always enhance product understanding, or can they sometimes introduce redundancy or degrade performance? Existing datasets are limited in both scale and design, making it difficult to systematically examine this question. To this end, we introduce EcomMMMU, an e-commerce multimodal multitask understanding dataset with 406,190 samples and 8,989,510 images. EcomMMMU is comprised of multi-image visual-language data designed with 8 essential tasks and a specialized VSS subset to benchmark the capability of multimodal large language models (MLLMs) to effectively utilize visual content. Analysis on EcomMMMU reveals that product images do not consistently improve performance and can, in some cases, degrade it. This indicates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis
