Ola: Pushing the Frontiers of Omni-Modal Language Model
Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao

TL;DR
Ola is an open-source omni-modal language model that achieves competitive performance across image, video, and audio understanding by innovative architecture, data strategies, and a progressive training pipeline, advancing the field of multi-modal AI.
Contribution
The paper introduces Ola, a novel omni-modal language model that significantly improves multi-modal understanding and aligns cross-modal representations more effectively than previous open-source models.
Findings
Ola surpasses existing open omni-modal models across all modalities.
Ola achieves performance comparable to specialized models of similar size.
The proposed training strategy enhances cross-modal alignment and understanding.
Abstract
Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-source alternatives have emerged, there is still a notable lag behind specialized single-modality models in performance. In this paper, we present Ola, an Omni-modal Language model that achieves competitive performance across image, video, and audio understanding compared to specialized counterparts, pushing the frontiers of the omni-modal language model to a large extent. We conduct a comprehensive exploration of architectural design, data curation, and training strategies essential for building a robust omni-modal model. Ola incorporates advanced visual understanding and audio recognition capabilities through several critical and effective improvements over mainstream baselines. Moreover, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling
