AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning

Urjitkumar Patel; Fang-Chun Yeh; Chinmay Gondhalekar

arXiv:2511.15578·cs.CV·April 13, 2026

AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning

Urjitkumar Patel, Fang-Chun Yeh, Chinmay Gondhalekar

PDF

TL;DR

AVATAAR is a modular, interpretable framework that enhances long-form video question answering by combining global and local context with iterative reasoning, leading to significant performance improvements.

Contribution

Introduces AVATAAR, a novel framework integrating global/local context, retrieval, and reasoning modules with feedback for improved video understanding.

Findings

01

Achieves +5.6% in temporal reasoning on CinePile

02

Gains +8% in theme-based questions

03

Feedback loop enhances adaptability and performance

Abstract

With the increasing prevalence of video content, effectively understanding and answering questions about long form videos has become essential for numerous applications. Although large vision language models (LVLMs) have enhanced performance, they often face challenges with nuanced queries that demand both a comprehensive understanding and detailed analysis. To overcome these obstacles, we introduce AVATAAR, a modular and interpretable framework that combines global and local video context, along with a Pre Retrieval Thinking Agent and a Rethink Module. AVATAAR creates a persistent global summary and establishes a feedback loop between the Rethink Module and the Pre Retrieval Thinking Agent, allowing the system to refine its retrieval strategies based on partial answers and replicate human-like iterative reasoning. On the CinePile benchmark, AVATAAR demonstrates significant improvements…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.