Valley3: Scaling Omni Foundation Models for E-commerce

Zeyu Chen; Guanghao Zhou; Qixiang Yin; Ziwang Zhao; Huanjin Yao; Pengjiu Xia; Min Yang; Cen Chen; Minghui Qiu

arXiv:2605.01278·cs.AI·May 7, 2026

Valley3: Scaling Omni Foundation Models for E-commerce

Zeyu Chen, Guanghao Zhou, Qixiang Yin, Ziwang Zhao, Huanjin Yao, Pengjiu Xia, Min Yang, Cen Chen, Minghui Qiu

PDF

TL;DR

Valley3 is a comprehensive omni multimodal large language model designed for diverse e-commerce tasks, integrating multilingual audio, cross-modal understanding, and reasoning capabilities, with enhanced search and reasoning modes.

Contribution

The paper introduces Valley3, a novel omni e-commerce LLM with native multilingual audio support, multi-stage pre-training, and advanced reasoning and search functionalities.

Findings

01

Valley3 outperforms strong baselines on proprietary and open-source e-commerce benchmarks.

02

The model demonstrates effective long-context reasoning and multi-modal understanding.

03

Post-training improves Valley3's reasoning depth and efficiency.

Abstract

In this work, we present Valley3, an omni multimodal large language model (MLLM) developed for diverse global e-commerce tasks, with unified understanding and reasoning capabilities across text, images, video, and audio. A key feature of Valley3 is its native multilingual audio capability for e-commerce, developed by extending vision-language models to better support crucial audio-visual tasks, particularly in short-video scenarios. To achieve this, we carefully design a four-stage omni e-commerce continued pre-training pipeline, through which Valley3 progressively acquires audio understanding, cross-modal instruction-following, e-commerce domain knowledge, and long-context reasoning capabilities, ultimately evolving into an omni model for diverse e-commerce scenarios. Then, we further improve Valley3 through post-training to encourage long-chain reasoning with controllable reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.