Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Liang Chen, Zekun Wang, Shuhuai Ren, Lei Li, Haozhe Zhao, Yunshui Li,, Zefan Cai, Hongcheng Guo, Lei Zhang, Yizhe Xiong, Yichi Zhang, Ruoyu Wu,, Qingxiu Dong, Ge Zhang, Jian Yang, Lingwei Meng, Shujie Hu, Yulong Chen,, Junyang Lin, Shuai Bai, Andreas Vlachos, Xu Tan

TL;DR
This survey reviews how Next Token Prediction (NTP) is expanding from language models to multimodal tasks, proposing a taxonomy to unify understanding and generation across modalities.
Contribution
It introduces a comprehensive taxonomy for multimodal NTP, covering tokenization, model architectures, task representation, datasets, and challenges, aiding future research.
Findings
NTP is effective across various modalities.
A unified taxonomy for multimodal NTP is proposed.
Open challenges in multimodal NTP are identified.
Abstract
Building on the foundations of language modeling in natural language processing, Next Token Prediction (NTP) has evolved into a versatile training objective for machine learning tasks across various modalities, achieving considerable success. As Large Language Models (LLMs) have advanced to unify understanding and generation tasks within the textual modality, recent research has shown that tasks from different modalities can also be effectively encapsulated within the NTP framework, transforming the multimodal information into tokens and predict the next one given the context. This survey introduces a comprehensive taxonomy that unifies both understanding and generation within multimodal learning through the lens of NTP. The proposed taxonomy covers five key aspects: Multimodal tokenization, MMNTP model architectures, unified task representation, datasets \& evaluation, and open…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
