Adaptive Video Understanding Agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning
Sullam Jeoung, Goeric Huybrechts, Bhavana Ganesh, Aram Galstyan,, Sravan Bodapati

TL;DR
This paper introduces an adaptive video understanding agent that uses query-driven frame sampling and feedback-based reasoning with large language models to improve efficiency and accuracy in analyzing long videos.
Contribution
It presents a novel agent-based approach combining query-adaptive frame sampling and self-reflective feedback to enhance long-form video understanding with LLMs.
Findings
Achieves state-of-the-art performance on video benchmarks.
Reduces the number of frames sampled, improving efficiency.
Demonstrates effective reasoning and relevance filtering in video analysis.
Abstract
Understanding long-form video content presents significant challenges due to its temporal complexity and the substantial computational resources required. In this work, we propose an agent-based approach to enhance both the efficiency and effectiveness of long-form video understanding by utilizing large language models (LLMs) and their tool-harnessing ability. A key aspect of our method is query-adaptive frame sampling, which leverages the reasoning capabilities of LLMs to process only the most relevant frames in real-time, and addresses an important limitation of existing methods which typically involve sampling redundant or irrelevant frames. To enhance the reasoning abilities of our video-understanding agent, we leverage the self-reflective capabilities of LLMs to provide verbal reinforcement to the agent, which leads to improved performance while minimizing the number of frames…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics
