VQQA: An Agentic Approach for Video Evaluation and Quality Improvement

Yiwen Song; Tomas Pfister; Yale Song

arXiv:2603.12310·cs.CV·March 16, 2026

VQQA: An Agentic Approach for Video Evaluation and Quality Improvement

Yiwen Song, Tomas Pfister, Yale Song

PDF

Open Access

TL;DR

VQQA introduces a multi-agent, vision-language framework that improves video generation quality by generating visual questions and using critiques as semantic feedback, enabling efficient, closed-loop prompt optimization.

Contribution

The paper presents VQQA, a novel, generalizable multi-agent system that uses visual question answering and critiques for effective video quality enhancement in a black-box setting.

Findings

01

Achieves +11.57% improvement on T2V-CompBench

02

Achieves +8.43% improvement on VBench2

03

Outperforms existing stochastic search and prompt optimization methods

Abstract

Despite rapid advancements in video generation models, aligning their outputs with complex user intent remains challenging. Existing test-time optimization methods are typically either computationally expensive or require white-box access to model internals. To address this, we present VQQA (Video Quality Question Answering), a unified, multi-agent framework generalizable across diverse input modalities and video generation tasks. By dynamically generating visual questions and using the resulting Vision-Language Model (VLM) critiques as semantic gradients, VQQA replaces traditional, passive evaluation metrics with human-interpretable, actionable feedback. This enables a highly efficient, closed-loop prompt optimization process via a black-box natural language interface. Extensive experiments demonstrate that VQQA effectively isolates and resolves visual artifacts, substantially…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning