ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation
Tony Montes, Fernando Lozano

TL;DR
ViQAgent introduces a novel zero-shot VideoQA framework combining Chain-of-Thought reasoning with open-vocabulary grounding, significantly improving object tracking and answer accuracy across multiple benchmarks.
Contribution
The paper presents a new LLM-based agent that integrates grounding validation with reasoning, setting a new state-of-the-art in zero-shot VideoQA.
Findings
Achieved state-of-the-art results on NExT-QA, iVQA, and ActivityNet-QA.
Enhanced object tracking and grounding validation capabilities.
Improved answer accuracy and reliability across diverse video datasets.
Abstract
Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant improvements remain in tracking objects for grounding over time and decision-making based on reasoning to better align object references with language model outputs, as newer models get better at both tasks. This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA) that combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment. This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsALIGN
