MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding

Junxian Wu; Chenghan Fu; Zhanheng Nie; Daoze Zhang; Bowen Wan; Wanxian Guan; Chuan Yu; Jian Xu; Bo Zheng

arXiv:2604.00513·cs.LG·April 3, 2026

MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding

Junxian Wu, Chenghan Fu, Zhanheng Nie, Daoze Zhang, Bowen Wan, Wanxian Guan, Chuan Yu, Jian Xu, Bo Zheng

PDF

TL;DR

MOON3.0 is a reasoning-aware multimodal model designed for detailed product understanding in e-commerce, leveraging novel modules and learning strategies to improve fine-grained attribute modeling.

Contribution

It introduces MOON3.0, the first reasoning-aware MLLM for product representation, with modules for adaptive fusion, autonomous reasoning, and detail preservation.

Findings

01

Achieves state-of-the-art zero-shot performance on multiple datasets.

02

Effectively models fine-grained product attributes.

03

Outperforms existing models in product understanding tasks.

Abstract

With the rapid growth of e-commerce, exploring general representations rather than task-specific ones has attracted increasing attention. Although recent multimodal large language models (MLLMs) have driven significant progress in product understanding, they are typically employed as feature extractors that implicitly encode product information into global embeddings, thereby limiting their ability to capture fine-grained attributes. Therefore, we argue that leveraging the reasoning capabilities of MLLMs to explicitly model fine-grained product attributes holds significant potential. Nevertheless, achieving this goal remains non-trivial due to several key challenges: (i) long-context reasoning tends to dilute the model's attention to salient information in the raw input; (ii) supervised fine-tuning (SFT) primarily encourages rigid imitation, limiting the exploration of effective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.