An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering   Using a VLM

Wonkyun Kim; Changin Choi; Wonseok Lee; Wonjong Rhee

arXiv:2403.18406·cs.CV·March 28, 2024·1 cites

An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM

Wonkyun Kim, Changin Choi, Wonseok Lee, Wonjong Rhee

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel zero-shot video question answering method that transforms videos into image grids, allowing the use of a single VLM without training on video data, and outperforms existing methods on most benchmarks.

Contribution

The study presents a simple, effective strategy to convert videos into image grids for zero-shot video QA, eliminating the need for video-specific training of vision language models.

Findings

01

Outperforms existing methods in 9 out of 10 benchmarks

02

Enables direct application of VLMs to videos without training

03

Maintains temporal information within image grids

Abstract

Stimulated by the sophisticated reasoning capabilities of recent Large Language Models (LLMs), a variety of strategies for bridging video modality have been devised. A prominent strategy involves Video Language Models (VideoLMs), which train a learnable interface with video data to connect advanced vision encoders with LLMs. Recently, an alternative strategy has surfaced, employing readily available foundation models, such as VideoLMs and LLMs, across multiple stages for modality bridging. In this study, we introduce a simple yet novel strategy where only a single Vision Language Model (VLM) is utilized. Our starting point is the plain insight that a video comprises a series of images, or frames, interwoven with temporal information. The essence of video comprehension lies in adeptly managing the temporal aspects along with the spatial details of each frame. Initially, we transform a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

imagegridworth/IG-VLM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection · Image Retrieval and Classification Techniques