MIO: A Foundation Model on Multimodal Tokens
Zekun Wang, King Zhu, Chunpu Xu, Wangchunshu Zhou, Jiaheng Liu, Yibo Zhang, Jiashuo Wang, Ning Shi, Siyu Li, Yizhi Li, Haoran Que, Zhaoxiang Zhang, Yuanxing Zhang, Ge Zhang, Ke Xu, Jie Fu, Wenhao Huang

TL;DR
MIO is a versatile multimodal foundation model capable of understanding and generating speech, text, images, and videos in an integrated manner, advancing the capabilities of artificial general intelligence.
Contribution
We introduce MIO, a novel multimodal foundation model trained on diverse tokens, enabling true any-to-any modality understanding and generation with a comprehensive four-stage training process.
Findings
MIO outperforms previous dual-modal and modality-specific models in various tasks.
MIO demonstrates advanced multimodal capabilities like interleaved video-text generation.
MIO achieves competitive or superior results across multiple benchmarks.
Abstract
In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO…
Peer Reviews
Decision·Submitted to ICLR 2025
- The paper explores the potential of unifying four different modalities in a DIDO manner using existing tokenizers within a single causal MLLM. - It also enables the generation of multimodal output in interleaved sequences.
- The paper demonstrates limited novelty in comparison to existing work within the research community. - The overall performance of the proposed model is less competitive and lacks comprehensiveness. For example: - In the image understanding task, the dual-modal baselines used for comparison are relatively weak, with some considered obsolete. - In the image generation task, the reliance on the CLIP score metric is limiting, as it primarily focuses on text alignment and overlooks important as
1. This paper systematically analyzes the limitations of current MLLM models and focuses on their critical aspect: any-to-any understanding and generation capabilities. 2. This paper examines images, videos, and speech to explore how to create a generalized model for all modalities while noting differences among them, such as variations in token length distribution. 3. This paper compares MIO with various existing MLLM models across different tasks to showcase its effectiveness.
1. This paper uses speech-enhanced pretraining to address variations in token length distribution. However, this approach is more of a patch than a well solution. 2. This paper states that "GPT-4o has showcased... However, it is closed-source." Therefore, will MIO be made publicly available? We do not require an additional closed-source copy of GPT-4o for research purposes. 3. This paper aims to address any-to-any understanding and generation. It should possess capabilities beyond existing metho
1. Similar to GPT-4o, the proposed "MIO" model is capable of understanding and generating multimodal contents across text, image, speech, and video modalities, which is a notable advancement. 2. To address the disparity of different modalities, the paper proposes a three-stage pretraining process, including alignment, interleaving, and speech ability enhancement. Experiments show that with such a design, the model can generate various modality contents in a unified model. 3. MIO exhibits advan
1. This paper heavily relied on each modality's existing tokenizer or detokenizer, so it is restricted by their drawbacks. For example, SpeechTokenizer, although it provides a good semantic and acoustic discrete representation of a waveform, is RVQ-based and has many codebooks. So speech generation is inefficient (with 200hz). 2. As the paper claims multi-modal tokens for LLM, although it focuses on video/speech/image understanding and generation, it is necessary to evaluate the text-related be
Code & Models
Videos
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Multi-Agent Systems and Negotiation
