Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor, Gomes, Anelia Angelova

TL;DR
Mirasol3B introduces a novel multimodal autoregressive model that separately processes time-aligned and contextual modalities, effectively handling long sequences and achieving state-of-the-art results on benchmarks.
Contribution
The paper proposes a decoupled multimodal modeling approach with specialized autoregressive components and a Combiner mechanism for efficient long-sequence processing.
Findings
Achieves state-of-the-art performance on multimodal benchmarks.
Effectively models long-range dependencies in media inputs.
Reduces computational demands through compact representations.
Abstract
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g., video, audio, text). For example, video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text, which comes as a global context, e.g., a title, or a description. Furthermore, video and audio inputs are of much larger volumes, and grow as the video length increases, which naturally requires more compute dedicated to these modalities and makes modeling of long-range dependencies harder. We here decouple the multimodal modeling, dividing it into separate, focused autoregressive models, processing the inputs according to the characteristics of the modalities. We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Video Analysis and Summarization · Speech Recognition and Synthesis
