Kwai Keye-VL Technical Report

Kwai Keye Team; Biao Yang; Bin Wen; Changyi Liu; Chenglong Chu; Chengru Song; Chongling Rao; Chuan Yi; Da Li; Dunju Zang; Fan Yang; Guorui Zhou; Hao Peng; Haojie Ding; Jiaming Huang; Jiangxia Cao; Jiankang Chen; Jingyun Hua; Jin Ouyang; Kaibing Chen; Kaiyu Jiang; Kaiyu Tang; Kun Gai; Shengnan Zhang; Siyang Mao; Sui Huang; Tianke Zhang; Tingting Gao; Wei Chen; Wei Yuan; Xiangyu Wu; Xiao Hu; Xingyu Lu; Yang Zhou; Yi-Fan Zhang; Yiping Yang; Yulong Chen; Zhenhua Wu; Zhenyu Li; Zhixin Ling; Ziming Li; Dehua Ma; Di Xu; Haixuan Gao; Hang Li; Jiawei Guo; Jing Wang; Lejian Ren; Muhao Wei; Qianqian Wang; Qigen Hu; Shiyao Wang; Tao Yu; Xinchen Luo; Yan Li; Yiming Liang; Yuhang Hu; Zeyi Lu; Zhuoran Yang; Zixing Zhang

arXiv:2507.01949·cs.CV·July 3, 2025

Kwai Keye-VL Technical Report

Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Hao Peng, Haojie Ding, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Jin Ouyang, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang

PDF

Open Access 1 Repo 3 Models 2 Datasets

TL;DR

Kwai Keye-VL is a large multimodal model designed specifically for understanding short videos, leveraging a massive dataset and innovative training techniques to outperform existing models on video benchmarks while maintaining strong vision-language abilities.

Contribution

The paper introduces Kwai Keye-VL, a novel 8-billion-parameter multimodal model with a unique training recipe and a large video-focused dataset, achieving state-of-the-art short-video understanding.

Findings

01

Achieves state-of-the-art results on public video benchmarks.

02

Maintains competitive performance on image-based tasks.

03

Develops a new benchmark for real-world short-video scenarios.

Abstract

While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities on static images, they often fall short in comprehending dynamic, information-dense short-form videos, a dominant medium in today's digital landscape. To bridge this gap, we introduce \textbf{Kwai Keye-VL}, an 8-billion-parameter multimodal foundation model engineered for leading-edge performance in short-video understanding while maintaining robust general-purpose vision-language abilities. The development of Keye-VL rests on two core pillars: a massive, high-quality dataset exceeding 600 billion tokens with a strong emphasis on video, and an innovative training recipe. This recipe features a four-stage pre-training process for solid vision-language alignment, followed by a meticulous two-phase post-training process. The first post-training stage enhances foundational capabilities like instruction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kwai-keye/keye
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning