Valley2: Exploring Multimodal Models with Scalable Vision-Language Design
Ziheng Wu, Zhenghao Chen, Ruipu Luo, Can Zhang, Yuan Gao, Zhentao He,, Xian Wang, Haoran Lin, Minghui Qiu

TL;DR
Valley2 is a new multimodal vision-language model that significantly improves performance in e-commerce and video understanding tasks, setting new benchmarks and expanding practical applications.
Contribution
Introducing Valley2, a scalable multimodal model that achieves state-of-the-art results in e-commerce and video tasks, with open-source code and weights.
Findings
State-of-the-art performance on e-commerce benchmarks (79.66)
Second place on OpenCompass leaderboard with 67.4 score
Outperforms similar-sized open-source models significantly
Abstract
Recently, vision-language models have made remarkable progress, demonstrating outstanding capabilities in various tasks such as image captioning and video understanding. We introduce Valley2, a novel multimodal large language model designed to enhance performance across all domains and extend the boundaries of practical applications in e-commerce and short video scenarios. Notably, Valley2 achieves state-of-the-art (SOTA) performance on e-commerce benchmarks, surpassing open-source models of similar size by a large margin (79.66 vs. 72.76). Additionally, Valley2 ranks second on the OpenCompass leaderboard among models with fewer than 10B parameters, with an impressive average score of 67.4. The code and model weights are open-sourced at https://github.com/bytedance/Valley.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
