A Simple LLM Framework for Long-Range Video Question-Answering

Ce Zhang; Taixi Lu; Md Mohaiminul Islam; Ziyang Wang; Shoubin Yu,; Mohit Bansal; Gedas Bertasius

arXiv:2312.17235·cs.CV·October 11, 2024·1 cites

A Simple LLM Framework for Long-Range Video Question-Answering

Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu,, Mohit Bansal, Gedas Bertasius

PDF

Open Access 1 Repo

TL;DR

LLoVi is a simple, effective framework for long-range video question-answering that leverages visual captioners and large language models, achieving state-of-the-art results without complex modeling techniques.

Contribution

The paper introduces LLoVi, a straightforward LVQA approach using captioning and LLMs, outperforming prior complex methods on multiple benchmarks.

Findings

01

Achieves 50.3% accuracy on EgoSchema, surpassing previous best by 18.1%.

02

Outperforms state-of-the-art on NeXT-QA and IntentQA datasets.

03

Extends to grounded LVQA, outperforming all prior methods.

Abstract

We present LLoVi, a language-based framework for long-range video question-answering (LVQA). Unlike prior long-range video understanding methods, which are often costly and require specialized long-range video modeling design (e.g., memory queues, state-space layers, etc.), our approach uses a frame/clip-level visual captioner (e.g., BLIP2, LaViLa, LLaVA) coupled with a Large Language Model (GPT-3.5, GPT-4) leading to a simple yet surprisingly effective LVQA framework. Specifically, we decompose short and long-range modeling aspects of LVQA into two stages. First, we use a short-term visual captioner to generate textual descriptions of short video clips (0.5-8s in length) densely sampled from a long input video. Afterward, an LLM aggregates the densely extracted short-term captions to perform long-range temporal reasoning needed to understand the whole video and answer a question. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ceezh/llovi
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques