Valley2: Exploring Multimodal Models with Scalable Vision-Language   Design

Ziheng Wu; Zhenghao Chen; Ruipu Luo; Can Zhang; Yuan Gao; Zhentao He,; Xian Wang; Haoran Lin; Minghui Qiu

arXiv:2501.05901·cs.CV·January 14, 2025

Valley2: Exploring Multimodal Models with Scalable Vision-Language Design

Ziheng Wu, Zhenghao Chen, Ruipu Luo, Can Zhang, Yuan Gao, Zhentao He,, Xian Wang, Haoran Lin, Minghui Qiu

PDF

Open Access 1 Repo 3 Models

TL;DR

Valley2 is a new multimodal vision-language model that significantly improves performance in e-commerce and video understanding tasks, setting new benchmarks and expanding practical applications.

Contribution

Introducing Valley2, a scalable multimodal model that achieves state-of-the-art results in e-commerce and video tasks, with open-source code and weights.

Findings

01

State-of-the-art performance on e-commerce benchmarks (79.66)

02

Second place on OpenCompass leaderboard with 67.4 score

03

Outperforms similar-sized open-source models significantly

Abstract

Recently, vision-language models have made remarkable progress, demonstrating outstanding capabilities in various tasks such as image captioning and video understanding. We introduce Valley2, a novel multimodal large language model designed to enhance performance across all domains and extend the boundaries of practical applications in e-commerce and short video scenarios. Notably, Valley2 achieves state-of-the-art (SOTA) performance on e-commerce benchmarks, surpassing open-source models of similar size by a large margin (79.66 vs. 72.76). Additionally, Valley2 ranks second on the OpenCompass leaderboard among models with fewer than 10B parameters, with an impressive average score of 67.4. The code and model weights are open-sourced at https://github.com/bytedance/Valley.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bytedance/valley
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems