Adaptive Video Understanding Agent: Enhancing efficiency with dynamic   frame sampling and feedback-driven reasoning

Sullam Jeoung; Goeric Huybrechts; Bhavana Ganesh; Aram Galstyan,; Sravan Bodapati

arXiv:2410.20252·cs.CV·October 29, 2024

Adaptive Video Understanding Agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning

Sullam Jeoung, Goeric Huybrechts, Bhavana Ganesh, Aram Galstyan,, Sravan Bodapati

PDF

Open Access

TL;DR

This paper introduces an adaptive video understanding agent that uses query-driven frame sampling and feedback-based reasoning with large language models to improve efficiency and accuracy in analyzing long videos.

Contribution

It presents a novel agent-based approach combining query-adaptive frame sampling and self-reflective feedback to enhance long-form video understanding with LLMs.

Findings

01

Achieves state-of-the-art performance on video benchmarks.

02

Reduces the number of frames sampled, improving efficiency.

03

Demonstrates effective reasoning and relevance filtering in video analysis.

Abstract

Understanding long-form video content presents significant challenges due to its temporal complexity and the substantial computational resources required. In this work, we propose an agent-based approach to enhance both the efficiency and effectiveness of long-form video understanding by utilizing large language models (LLMs) and their tool-harnessing ability. A key aspect of our method is query-adaptive frame sampling, which leverages the reasoning capabilities of LLMs to process only the most relevant frames in real-time, and addresses an important limitation of existing methods which typically involve sampling redundant or irrelevant frames. To enhance the reasoning abilities of our video-understanding agent, we leverage the self-reflective capabilities of LLMs to provide verbal reinforcement to the agent, which leads to improved performance while minimizing the number of frames…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics