ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation

Tony Montes; Fernando Lozano

arXiv:2505.15928·cs.CV·May 23, 2025

ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation

Tony Montes, Fernando Lozano

PDF

Open Access 1 Repo

TL;DR

ViQAgent introduces a novel zero-shot VideoQA framework combining Chain-of-Thought reasoning with open-vocabulary grounding, significantly improving object tracking and answer accuracy across multiple benchmarks.

Contribution

The paper presents a new LLM-based agent that integrates grounding validation with reasoning, setting a new state-of-the-art in zero-shot VideoQA.

Findings

01

Achieved state-of-the-art results on NExT-QA, iVQA, and ActivityNet-QA.

02

Enhanced object tracking and grounding validation capabilities.

03

Improved answer accuracy and reliability across diverse video datasets.

Abstract

Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant improvements remain in tracking objects for grounding over time and decision-making based on reasoning to better align object references with language model outputs, as newer models get better at both tasks. This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA) that combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment. This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

t-montes/viqagent
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsALIGN