i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data
Ziyi Yang, Mahmoud Khademi, Yichong Xu, Reid Pryzant, Yuwei Fang,, Chenguang Zhu, Dongdong Chen, Yao Qian, Mei Gao, Yi-Ling Chen, Robert Gmyr,, Naoyuki Kanda, Noel Codella, Bin Xiao, Yu Shi, Lu Yuan, Takuya Yoshioka,, Michael Zeng, Xuedong Huang

TL;DR
i-Code V2 introduces a pioneering autoregressive model capable of generating natural language from combined vision, language, and speech data, advancing multimodal AI by integrating diverse modalities into a unified generative framework.
Contribution
It is the first model to generate language from any combination of vision, language, and speech data, using a novel modality-fusing encoder and end-to-end pretraining across multiple modalities.
Findings
Outperforms state-of-the-art on 7 multimodal tasks
Effectively integrates multiple modalities into a shared representation
Demonstrates strong generalization across diverse multimodal signals
Abstract
The convergence of text, visual, and audio data is a key step towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models which lack generative abilities. We propose closing this gap with i-Code V2, the first model capable of generating natural language from any combination of Vision, Language, and Speech data. i-Code V2 is an integrative system that leverages state-of-the-art single-modality encoders, combining their outputs with a new modality-fusing encoder in order to flexibly project combinations of modalities into a shared representational space. Next, language tokens are generated from these representations via an autoregressive decoder. The whole framework is pretrained end-to-end on a large collection of dual- and single-modality datasets using a novel text completion objective that can be generalized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Speech and dialogue systems
