UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
Mustafa Shukor, Corentin Dancette, Alexandre Rame, Matthieu Cord

TL;DR
UnIVAL is a 0.25B parameter unified multimodal model supporting text, images, video, and audio, demonstrating competitive performance across tasks without relying on massive datasets or billions of parameters.
Contribution
The paper introduces UnIVAL, a compact unified model that supports multiple modalities and tasks, and explores model merging techniques for improved generalization.
Findings
UnIVAL supports four modalities with competitive performance.
Model merging via weight interpolation benefits out-of-distribution generalization.
Efficient training on diverse tasks enables broad multimodal capabilities.
Abstract
Large Language Models (LLMs) have made the ambitious quest for generalist agents significantly far from being a fantasy. A key hurdle for building such general models is the diversity and heterogeneity of tasks and modalities. A promising solution is unification, allowing the support of a myriad of tasks and modalities within one unified framework. While few large models (e.g., Flamingo (Alayrac et al., 2022), trained on massive datasets, can support more than two modalities, current small to mid-scale unified models are still limited to 2 modalities, usually image-text or video-text. The question that we ask is: is it possible to build efficiently a unified model that can support all modalities? To answer this, we propose UnIVAL, a step further towards this ambitious goal. Without relying on fancy datasets sizes or models with billions of parameters, the ~ 0.25B parameter UnIVAL model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
