A Simple LLM Framework for Long-Range Video Question-Answering
Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu,, Mohit Bansal, Gedas Bertasius

TL;DR
LLoVi is a simple, effective framework for long-range video question-answering that leverages visual captioners and large language models, achieving state-of-the-art results without complex modeling techniques.
Contribution
The paper introduces LLoVi, a straightforward LVQA approach using captioning and LLMs, outperforming prior complex methods on multiple benchmarks.
Findings
Achieves 50.3% accuracy on EgoSchema, surpassing previous best by 18.1%.
Outperforms state-of-the-art on NeXT-QA and IntentQA datasets.
Extends to grounded LVQA, outperforming all prior methods.
Abstract
We present LLoVi, a language-based framework for long-range video question-answering (LVQA). Unlike prior long-range video understanding methods, which are often costly and require specialized long-range video modeling design (e.g., memory queues, state-space layers, etc.), our approach uses a frame/clip-level visual captioner (e.g., BLIP2, LaViLa, LLaVA) coupled with a Large Language Model (GPT-3.5, GPT-4) leading to a simple yet surprisingly effective LVQA framework. Specifically, we decompose short and long-range modeling aspects of LVQA into two stages. First, we use a short-term visual captioner to generate textual descriptions of short video clips (0.5-8s in length) densely sampled from a long input video. Afterward, an LLM aggregates the densely extracted short-term captions to perform long-range temporal reasoning needed to understand the whole video and answer a question. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
