i-Code V2: An Autoregressive Generation Framework over Vision, Language,   and Speech Data

Ziyi Yang; Mahmoud Khademi; Yichong Xu; Reid Pryzant; Yuwei Fang,; Chenguang Zhu; Dongdong Chen; Yao Qian; Mei Gao; Yi-Ling Chen; Robert Gmyr,; Naoyuki Kanda; Noel Codella; Bin Xiao; Yu Shi; Lu Yuan; Takuya Yoshioka,; Michael Zeng; Xuedong Huang

arXiv:2305.12311·cs.CL·May 23, 2023·1 cites

i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

Ziyi Yang, Mahmoud Khademi, Yichong Xu, Reid Pryzant, Yuwei Fang,, Chenguang Zhu, Dongdong Chen, Yao Qian, Mei Gao, Yi-Ling Chen, Robert Gmyr,, Naoyuki Kanda, Noel Codella, Bin Xiao, Yu Shi, Lu Yuan, Takuya Yoshioka,, Michael Zeng, Xuedong Huang

PDF

Open Access

TL;DR

i-Code V2 introduces a pioneering autoregressive model capable of generating natural language from combined vision, language, and speech data, advancing multimodal AI by integrating diverse modalities into a unified generative framework.

Contribution

It is the first model to generate language from any combination of vision, language, and speech data, using a novel modality-fusing encoder and end-to-end pretraining across multiple modalities.

Findings

01

Outperforms state-of-the-art on 7 multimodal tasks

02

Effectively integrates multiple modalities into a shared representation

03

Demonstrates strong generalization across diverse multimodal signals

Abstract

The convergence of text, visual, and audio data is a key step towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models which lack generative abilities. We propose closing this gap with i-Code V2, the first model capable of generating natural language from any combination of Vision, Language, and Speech data. i-Code V2 is an integrative system that leverages state-of-the-art single-modality encoders, combining their outputs with a new modality-fusing encoder in order to flexibly project combinations of modalities into a shared representational space. Next, language tokens are generated from these representations via an autoregressive decoder. The whole framework is pretrained end-to-end on a large collection of dual- and single-modality datasets using a novel text completion objective that can be generalized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Speech and dialogue systems