Learning Video Context as Interleaved Multimodal Sequences

Kevin Qinghong Lin; Pengchuan Zhang; Difei Gao; Xide Xia; Joya Chen,; Ziteng Gao; Jinheng Xie; Xuhong Xiao; Mike Zheng Shou

arXiv:2407.21757·cs.CV·September 13, 2024

Learning Video Context as Interleaved Multimodal Sequences

Kevin Qinghong Lin, Pengchuan Zhang, Difei Gao, Xide Xia, Joya Chen,, Ziteng Gao, Jinheng Xie, Xuhong Xiao, Mike Zheng Shou

PDF

Open Access 1 Repo

TL;DR

MovieSeq is a multimodal language model that represents narrative videos as interleaved sequences of images, text, and audio, enabling comprehensive understanding across various video analysis tasks.

Contribution

The paper introduces MovieSeq, a novel approach that models videos as interleaved multimodal sequences and uses instruction-tuning to improve video understanding capabilities.

Findings

01

Effective across six datasets and five tasks

02

Improves performance in video classification and question-answering

03

Demonstrates the benefit of multimodal interleaved sequences

Abstract

Narrative videos, such as movies, pose significant challenges in video understanding due to their rich contexts (characters, dialogues, storylines) and diverse demands (identify who, relationship, and reason). In this paper, we introduce MovieSeq, a multimodal language model developed to address the wide range of challenges in understanding video contexts. Our core idea is to represent videos as interleaved multimodal sequences (including images, plots, videos, and subtitles), either by linking external knowledge databases or using offline models (such as whisper for subtitles). Through instruction-tuning, this approach empowers the language model to interact with videos using interleaved multimodal instructions. For example, instead of solely relying on video as input, we jointly provide character photos alongside their names and dialogues, allowing the model to associate these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

showlab/movieseq
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLinguistic Education and Pedagogy · Digital Storytelling and Education