VQQA: An Agentic Approach for Video Evaluation and Quality Improvement
Yiwen Song, Tomas Pfister, Yale Song

TL;DR
VQQA introduces a multi-agent, vision-language framework that improves video generation quality by generating visual questions and using critiques as semantic feedback, enabling efficient, closed-loop prompt optimization.
Contribution
The paper presents VQQA, a novel, generalizable multi-agent system that uses visual question answering and critiques for effective video quality enhancement in a black-box setting.
Findings
Achieves +11.57% improvement on T2V-CompBench
Achieves +8.43% improvement on VBench2
Outperforms existing stochastic search and prompt optimization methods
Abstract
Despite rapid advancements in video generation models, aligning their outputs with complex user intent remains challenging. Existing test-time optimization methods are typically either computationally expensive or require white-box access to model internals. To address this, we present VQQA (Video Quality Question Answering), a unified, multi-agent framework generalizable across diverse input modalities and video generation tasks. By dynamically generating visual questions and using the resulting Vision-Language Model (VLM) critiques as semantic gradients, VQQA replaces traditional, passive evaluation metrics with human-interpretable, actionable feedback. This enables a highly efficient, closed-loop prompt optimization process via a black-box natural language interface. Extensive experiments demonstrate that VQQA effectively isolates and resolves visual artifacts, substantially…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
