DrVideo: Document Retrieval Based Long Video Understanding

Ziyu Ma; Chenhui Gou; Hengcan Shi; Bin Sun; Shutao Li; Hamid; Rezatofighi; Jianfei Cai

arXiv:2406.12846·cs.CV·November 26, 2024

DrVideo: Document Retrieval Based Long Video Understanding

Ziyu Ma, Chenhui Gou, Hengcan Shi, Bin Sun, Shutao Li, Hamid, Rezatofighi, Jianfei Cai

PDF

Open Access

TL;DR

DrVideo introduces a novel approach for understanding long videos by converting them into text documents and iteratively retrieving key information, enabling large language models to perform long-range reasoning effectively.

Contribution

The paper presents DrVideo, a system that transforms long videos into text documents and uses an agent-based iterative process for improved long video understanding with LLMs.

Findings

01

Outperforms state-of-the-art LLM-based methods on multiple long video benchmarks.

02

Effectively retrieves and reasons over long video content.

03

Demonstrates significant accuracy improvements on benchmarks.

Abstract

Most of the existing methods for video understanding primarily focus on videos only lasting tens of seconds, with limited exploration of techniques for handling long videos. The increased number of frames in long videos poses two main challenges: difficulty in locating key information and performing long-range reasoning. Thus, we propose DrVideo, a document-retrieval-based system designed for long video understanding. Our key idea is to convert the long-video understanding problem into a long-document understanding task so as to effectively leverage the power of large language models. Specifically, DrVideo first transforms a long video into a coarse text-based long document to initially retrieve key frames and then updates the documents with the augmented key frame information. It then employs an agent-based iterative loop to continuously search for missing information and augment the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Generative Adversarial Networks and Image Synthesis

MethodsFocus