UnIVAL: Unified Model for Image, Video, Audio and Language Tasks

Mustafa Shukor; Corentin Dancette; Alexandre Rame; Matthieu Cord

arXiv:2307.16184·cs.CV·December 25, 2023·2 cites

UnIVAL: Unified Model for Image, Video, Audio and Language Tasks

Mustafa Shukor, Corentin Dancette, Alexandre Rame, Matthieu Cord

PDF

Open Access 1 Repo

TL;DR

UnIVAL is a 0.25B parameter unified multimodal model supporting text, images, video, and audio, demonstrating competitive performance across tasks without relying on massive datasets or billions of parameters.

Contribution

The paper introduces UnIVAL, a compact unified model that supports multiple modalities and tasks, and explores model merging techniques for improved generalization.

Findings

01

UnIVAL supports four modalities with competitive performance.

02

Model merging via weight interpolation benefits out-of-distribution generalization.

03

Efficient training on diverse tasks enables broad multimodal capabilities.

Abstract

Large Language Models (LLMs) have made the ambitious quest for generalist agents significantly far from being a fantasy. A key hurdle for building such general models is the diversity and heterogeneity of tasks and modalities. A promising solution is unification, allowing the support of a myriad of tasks and modalities within one unified framework. While few large models (e.g., Flamingo (Alayrac et al., 2022), trained on massive datasets, can support more than two modalities, current small to mid-scale unified models are still limited to 2 modalities, usually image-text or video-text. The question that we ask is: is it possible to build efficiently a unified model that can support all modalities? To answer this, we propose UnIVAL, a step further towards this ambitious goal. Without relying on fancy datasets sizes or models with billions of parameters, the ~ 0.25B parameter UnIVAL model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mshukor/unival
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques