Mirasol3B: A Multimodal Autoregressive model for time-aligned and   contextual modalities

AJ Piergiovanni; Isaac Noble; Dahun Kim; Michael S. Ryoo; Victor; Gomes; Anelia Angelova

arXiv:2311.05698·cs.CV·April 5, 2024·1 cites

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor, Gomes, Anelia Angelova

PDF

Open Access

TL;DR

Mirasol3B introduces a novel multimodal autoregressive model that separately processes time-aligned and contextual modalities, effectively handling long sequences and achieving state-of-the-art results on benchmarks.

Contribution

The paper proposes a decoupled multimodal modeling approach with specialized autoregressive components and a Combiner mechanism for efficient long-sequence processing.

Findings

01

Achieves state-of-the-art performance on multimodal benchmarks.

02

Effectively models long-range dependencies in media inputs.

03

Reduces computational demands through compact representations.

Abstract

One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g., video, audio, text). For example, video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text, which comes as a global context, e.g., a title, or a description. Furthermore, video and audio inputs are of much larger volumes, and grow as the video length increases, which naturally requires more compute dedicated to these modalities and makes modeling of long-range dependencies harder. We here decouple the multimodal modeling, dividing it into separate, focused autoregressive models, processing the inputs according to the characteristics of the modalities. We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization · Speech Recognition and Synthesis