Kelix Technical Report

Boyang Ding; Chenglong Chu; Dunju Zang; Han Li; Jiangxia Cao; Kun Gai; Muhao Wei; Ruiming Tang; Shiyao Wang; Siyang Mao; Xinchen Luo; Yahui Liu; Zhixin Ling; Zhuoran Yang; Ziming Li; Chengru Song; Guorui Zhou; Guowang Zhang; Hao Peng; Hao Wang; Jiaxin Deng; Jin Ouyang; Jinghao Zhang; Lejian Ren; Qianqian Wang; Qigen Hu; Tao Wang; Xingmei Wang; Yiping Yang; Zixing Zhang; Ziqi Wang

arXiv:2602.09843·cs.CV·February 13, 2026

Kelix Technical Report

Boyang Ding, Chenglong Chu, Dunju Zang, Han Li, Jiangxia Cao, Kun Gai, Muhao Wei, Ruiming Tang, Shiyao Wang, Siyang Mao, Xinchen Luo, Yahui Liu, Zhixin Ling, Zhuoran Yang, Ziming Li, Chengru Song, Guorui Zhou, Guowang Zhang, Hao Peng, Hao Wang, Jiaxin Deng, Jin Ouyang

PDF

Open Access

TL;DR

Kelix is a fully discrete autoregressive model that enhances multimodal understanding by bridging the gap between discrete visual tokens and continuous representations, enabling more effective unified language and vision processing.

Contribution

Kelix introduces a novel fully discrete autoregressive model that improves multimodal understanding, addressing limitations of previous discrete visual tokenization methods.

Findings

01

Kelix achieves comparable understanding to continuous-feature VLMs.

02

Discrete visual tokens in Kelix retain more information than previous methods.

03

Kelix demonstrates improved performance in multimodal tasks.

Abstract

Autoregressive large language models (LLMs) scale well by expressing diverse tasks as sequences of discrete natural-language tokens and training with next-token prediction, which unifies comprehension and generation under self-supervision. Extending this paradigm to multimodal data requires a shared, discrete representation across modalities. However, most vision-language models (VLMs) still rely on a hybrid interface: discrete text tokens paired with continuous Vision Transformer (ViT) features. Because supervision is largely text-driven, these models are often biased toward understanding and cannot fully leverage large-scale self-supervised learning on non-text data. Recent work has explored discrete visual tokenization to enable fully autoregressive multimodal modeling, showing promising progress toward unified understanding and generation. Yet existing discrete vision tokens…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling