Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation
Yachun Mi, Yu Li, Yanting Li, Chen Hui, Tong Zhang, Zhixuan Li, Chenyue Song, Wei Yang Bryan Lim, Shaohui Liu

TL;DR
Q-CLIP leverages vision-language models with minimal training to improve video quality assessment, reducing computational costs and enhancing sensitivity to quality variations through novel prompts and sampling strategies.
Contribution
This work introduces the first fully VLM-based VQA framework, Q-CLIP, with a shared cross-modal adapter and quality prompts, addressing computational efficiency and sensitivity issues.
Findings
Q-CLIP achieves state-of-the-art results on multiple VQA datasets.
Frame-difference sampling improves generalization across datasets.
Minimal trainable parameters reduce computational costs significantly.
Abstract
Accurate and efficient Video Quality Assessment (VQA) has long been a key research challenge. Current mainstream VQA methods typically improve performance by pretraining on large-scale classification datasets (e.g., ImageNet, Kinetics-400), followed by fine-tuning on VQA datasets. However, this strategy presents two significant challenges: (1) merely transferring semantic knowledge learned from pretraining is insufficient for VQA, as video quality depends on multiple factors (e.g., semantics, distortion, motion, aesthetics); (2) pretraining on large-scale datasets demands enormous computational resources, often dozens or even hundreds of times greater than training directly on VQA datasets. Recently, Vision-Language Models (VLMs) have shown remarkable generalization capabilities across a wide range of visual tasks, and have begun to demonstrate promising potential in quality assessment.…
Peer Reviews
Decision·Submitted to ICLR 2026
- The most significant contribution is the method's efficiency. Achieving SOTA results by only finetuning 0.14M parameters is highly compelling. - The method consistently achieves SOTA or competitive performance across a wide range of intra-dataset and cross-dataset benchmarks. The performance gains over other VLM-utilizing methods like CLIPVQA are notable. - The systematic study of frame sampling strategies is a useful, practical contribution. The finding that frame-difference-based (motion-awa
- There are lack of explicit temporal modeling in Q-CLIP. The ablation study on frame feature fusion (Fig. 7) shows that simple mean pooling of frame features outperforms explicit temporal modeling architectures like Transformers and Mamba. It looks like to be an aggregation of frame-level quality scores. This raisessignificant doubts about its ability to assess temporal artifacts (e.g., judder, flickering, motion stutter). - The method's performance heavily relies on a *specific* VLM backbone (
1. This work achieves better performance with smaller parameters on VQA datasets. 2. This submission is well-written and easy to follow.
1. The novelty of this work is somewhat weak. This work is somewhat like combining the techniques proposed in CLIP-IQA (text-vision similarity), Q-Align (five rating levels), together with common learnable prompts. Besides, the difference-based sampling method is also simple and intuitive, which is more like a baseline method. 2. Compared with the original CLIP, the main revision is the proposed SCMA module. However, the reason why this module works is not well analyzed or supported. 3. In Tab
(1) Clear writing and well-defined motivation The paper follows a logical structure. Methods are clearly explained using figures and equations, and details of the model and training processes are thoroughly described, this makes reproducibility straightforward. The motivation of the work is also well-defined: it aims to solve two key problems in existing studies. One is that semantic knowledge from classification pretraining cannot capture multi-dimensional quality factors. The other is that lar
(1) In the comparison of fine-tuning methods, some approaches are missing. Examples include COOP and VPT. Although these methods may not perform well in VQA tasks, adding comparisons with them would make the experimental results more comprehensive. (2) Although efficiency experiments demonstrate that Q-CLIP outperforms most baseline methods, I believe there is still room for optimization in its parameter scale. Attempting to use smaller backbone networks may help achieve a better balance betwee
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Video Quality Assessment · Visual Attention and Saliency Detection · Advanced Data Compression Techniques
