An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
Wonkyun Kim, Changin Choi, Wonseok Lee, Wonjong Rhee

TL;DR
This paper introduces a novel zero-shot video question answering method that transforms videos into image grids, allowing the use of a single VLM without training on video data, and outperforms existing methods on most benchmarks.
Contribution
The study presents a simple, effective strategy to convert videos into image grids for zero-shot video QA, eliminating the need for video-specific training of vision language models.
Findings
Outperforms existing methods in 9 out of 10 benchmarks
Enables direct application of VLMs to videos without training
Maintains temporal information within image grids
Abstract
Stimulated by the sophisticated reasoning capabilities of recent Large Language Models (LLMs), a variety of strategies for bridging video modality have been devised. A prominent strategy involves Video Language Models (VideoLMs), which train a learnable interface with video data to connect advanced vision encoders with LLMs. Recently, an alternative strategy has surfaced, employing readily available foundation models, such as VideoLMs and LLMs, across multiple stages for modality bridging. In this study, we introduce a simple yet novel strategy where only a single Vision Language Model (VLM) is utilized. Our starting point is the plain insight that a video comprises a series of images, or frames, interwoven with temporal information. The essence of video comprehension lies in adeptly managing the temporal aspects along with the spatial details of each frame. Initially, we transform a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection · Image Retrieval and Classification Techniques
