Self-Chained Image-Language Model for Video Localization and Question   Answering

Shoubin Yu; Jaemin Cho; Prateek Yadav; Mohit Bansal

arXiv:2305.06988·cs.CV·December 1, 2023·25 cites

Self-Chained Image-Language Model for Video Localization and Question Answering

Shoubin Yu, Jaemin Cho, Prateek Yadav, Mohit Bansal

PDF

Open Access 1 Repo 1 Video

TL;DR

SeViLA is a novel framework that leverages a single image-language model to efficiently localize keyframes and answer questions in videos, reducing annotation costs and improving performance.

Contribution

The paper introduces SeViLA, a parameter-efficient, self-refining approach that combines temporal localization and question answering using a single pre-trained model, outperforming existing methods.

Findings

01

Outperforms strong baselines on 5 video QA benchmarks.

02

Achieves state-of-the-art results in fine-tuning and zero-shot settings.

03

Effective self-refinement reduces the need for expensive annotations.

Abstract

Recent studies have shown promising results on utilizing large pre-trained image-language models for video question answering. While these image-language models can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled video frames as visual inputs without explicit language-aware, temporal modeling. When only a portion of a video input is relevant to the language query, such uniform frame sampling can often lead to missing important visual cues. Although humans often find a video moment to focus on and rewind the moment to answer questions, training a query-aware video moment localizer often requires expensive annotations and high computational costs. To address this issue, we propose Self-Chained Video Localization-Answering (SeViLA), a novel framework that leverages a single image-language model (BLIP-2) to tackle both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yui010206/sevila
pytorchOfficial

Videos

Self-Chained Image-Language Model for Video Localization and Question Answering· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques