VQA$^2$: Visual Question Answering for Video Quality Assessment
Ziheng Jia, Zicheng Zhang, Jiaying Qian, Haoning Wu, Wei Sun, Chunyi, Li, Xiaohong Liu, Weisi Lin, Guangtao Zhai, Xiongkuo Min

TL;DR
This paper introduces VQA2, a new dataset and models for video quality assessment using visual question answering, achieving state-of-the-art results and surpassing GPT-4o in understanding tasks.
Contribution
It presents the first VQA instruction dataset for video quality assessment and develops models that integrate spatial-temporal perception, advancing low-level video quality understanding.
Findings
VQA2 dataset contains 157,755 question-answer pairs across various video types.
VQA2-Assistant outperforms GPT-4o in visual quality understanding tasks.
Models achieve strong performance in both video quality scoring and understanding.
Abstract
The advent and proliferation of large multi-modal models (LMMs) have introduced new paradigms to computer vision, transforming various tasks into a unified visual question answering framework. Video Quality Assessment (VQA), a classic field in low-level visual perception, focused initially on quantitative video quality scoring. However, driven by advances in LMMs, it is now progressing toward more holistic visual quality understanding tasks. Recent studies in the image domain have demonstrated that Visual Question Answering (VQA) can markedly enhance low-level visual quality evaluation. Nevertheless, related work has not been explored in the video domain, leaving substantial room for improvement. To address this gap, we introduce the VQA2 Instruction Dataset - the first visual question answering instruction dataset that focuses on video quality assessment. This dataset consists of 3…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
