Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs
Zicheng Zhang, Ziheng Jia, Haoning Wu, Chunyi Li, Zijian Chen, Yingjie, Zhou, Wei Sun, Xiaohong Liu, Xiongkuo Min, Weisi Lin, Guangtao Zhai

TL;DR
Q-Bench-Video is a comprehensive benchmark designed to evaluate large multi-modal models' ability to understand and assess video quality across diverse sources, question types, and distortion categories.
Contribution
This paper introduces Q-Bench-Video, a new benchmark with diverse video sources, question formats, and distortion types to systematically evaluate LMMs' video quality understanding.
Findings
LMMs show basic understanding but lack precision.
Performance of LMMs is significantly below human levels.
The benchmark reveals specific areas for improvement in LMMs.
Abstract
With the rising interest in research on Large Multi-modal Models (LMMs) for video understanding, many studies have emphasized general video comprehension capabilities, neglecting the systematic exploration into video quality understanding. To address this oversight, we introduce Q-Bench-Video in this paper, a new benchmark specifically designed to evaluate LMMs' proficiency in discerning video quality. a) To ensure video source diversity, Q-Bench-Video encompasses videos from natural scenes, AI-generated Content (AIGC), and Computer Graphics (CG). b) Building on the traditional multiple-choice questions format with the Yes-or-No and What-How categories, we include Open-ended questions to better evaluate complex scenarios. Additionally, we incorporate the video pair quality comparison question to enhance comprehensiveness. c) Beyond the traditional Technical, Aesthetic, and Temporal…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The methodology for compiling the dataset is described in detail. The experiments are quite extensive: 17 VLM models and several datasets, including AIGC.
Testing only small-sized LMM models. There is no indication of how the scaling law affects the model's ability to understand video quality. The length of the VLM answer is not taken into account. Thus, if the model gave a long answer, it can receive higher score. Technical details regarding the operation of LMM are lacking: * Details on how exactly videos were input into each LMM are not thoroughly described, such as resolution and the number of frames (see the questions section).
1. Q-Bench-Video is the first benchmark specifically focused on assessing video quality understanding in Large Multi-modal Models (LMMs), addressing a unique and underexplored aspect of LMMs that goes beyond typical video comprehension. By evaluating quality-related distortions—including technical, aesthetic, temporal, and AIGC-specific aspects—it provides a novel and comprehensive perspective on video quality assessment. 2. The benchmark is meticulously designed with diverse video sources (nat
1. While Q-Bench-Video includes diverse video types, there is limited exploration of how LMMs generalize across these domains. Testing LMMs on specific domains, such as medical or surveillance videos, would enhance the benchmark’s relevance by showing how well models handle domain-specific quality variations, which are critical in many real-world applications. 2. The benchmark results indicate that LMMs struggle significantly with open-ended questions, but the paper lacks a detailed breakdown o
1. The benchmark includes a broad range of video types, including natural scenes, AIGC, and CG, enhancing the evaluation’s comprehensiveness. 2. By integrating various question types and quality concerns, It offers a novel framework for assessing different aspects of video quality. 3. The benchmark includes 2,378 question-answer pairs curated by experts, providing a reliable evaluation dataset.
1. Although the benchmark includes annotations reviewed by three additional participants, the performance on human evaluation is only 81.56%. This raises concerns about whether the annotations meet a robust confidence interval and whether there may be inherent biases in the data. The reliance on subjective assessments for video quality (such as aesthetics) could introduce inconsistencies. 2. The technical and aesthetic evaluations in the dataset are annotated by eight experts, but the paper does
- The paper is well-presented, with adequate examples and figures to show the query designs in the benchmark. - The questions can cover extensive aspects of video quality, including technical, aesthetic, temporal, and AIGC distortions. - The video pairs comparison is a good task for evaluating LLM's capability, which can be considered as an advantage over existing video benchmarks.
- Motivation of the Benchmark: The paper's motivation for building the benchmark is not sufficiently strong. While the introduction states the importance of video quality for viewer experience and high-quality video generation, what viewers primarily care about is the classification of video quality (e.g., low, medium, high). A well-trained classifier could fulfill this need more directly. Additionally, for assessing video generation quality, designing robust evaluation metrics is more crucial t
1. The problem is motivated well, and the solution is timely given the ever-increasing usage of LMMs in VQA. 2. The choice of real, AIG, and CG videos, along with the QA pairs, is well thought out. 3. The coverage of the LMMs is extensive.
1. In my opinion, the work’s full impact is not felt in the current form of the paper. Specifically, the key observation of humans > LMMs > random choice is neither surprising nor unexpected. The same can be said of the observation that proprietary models outperform open-source models. In light of this, the study should have been extended to show how the benchmark can help improve LMM performance to reduce the gap with human performance. Without this, a practitioner would be left wanting for the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Video Quality Assessment
