Breaking Annotation Barriers: Generalized Video Quality Assessment via Ranking-based Self-Supervision

Linhan Cao; Wei Sun; Kaiwei Zhang; Yicong Peng; Guangtao Zhai; Xiongkuo Min

arXiv:2505.03631·cs.CV·May 19, 2026

Breaking Annotation Barriers: Generalized Video Quality Assessment via Ranking-based Self-Supervision

Linhan Cao, Wei Sun, Kaiwei Zhang, Yicong Peng, Guangtao Zhai, Xiongkuo Min

PDF

1 Repo 1 Models 3 Reviews

TL;DR

This paper introduces a self-supervised learning framework for video quality assessment that leverages ranking-based methods and iterative self-improvement, enabling high performance without extensive manual annotations.

Contribution

The authors propose a novel self-supervised VQA approach using ranking-based learning and iterative refinement, trained on a large-scale unlabeled dataset, surpassing previous models in generalization and performance.

Findings

01

Achieves zero-shot performance comparable or better than supervised models.

02

Demonstrates strong out-of-distribution generalization across diverse videos.

03

Sets new state-of-the-art results when fine-tuned on labeled datasets.

Abstract

Video quality assessment (VQA) is essential for quantifying perceptual quality in various video processing workflows, spanning from camera capture systems to over-the-top streaming platforms. While recent supervised VQA models have made substantial progress, the reliance on manually annotated datasets -- a process that is labor-intensive, costly, and difficult to scale up -- has hindered further optimization of their generalization to unseen video content and distortions. To bridge this gap, we introduce a self-supervised learning framework for VQA to learn quality assessment capabilities from large-scale, unlabeled web videos. Our approach leverages a \textbf{learning-to-rank} paradigm to train a large multimodal model (LMM) on video pairs automatically labeled via two manners, including quality pseudo-labeling by existing VQA models and relative quality ranking based on synthetic…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 5

Strengths

- The application of weak-to-strong learning to VQA is novel and significant. It presents a scalable approach to move beyond the limitations of traditional supervised learning. - The paper's primary strength lies in its empirical results. The final model demonstrates substantial performance gains on OOD datasets compared to all baselines, including the SOTA teacher models it learns from. This directly supports the central claim that the proposed method improves generalization. - The iterative se

Weaknesses

- The framing relies on a "weak" teacher, but the selected teachers are, in fact, the current SOTA VQA models (e.g., DOVER, Q-Align). The work is less about learning from a *weak* model and more about distilling an ensemble of experts into a single, higher-capacity student. This reliance on a suite of pre-trained SOTA models is a importent prerequisite. - The student model is an ~8B parameter LMM, while the teachers are orders of magnitude smaller. It is difficult to disentangle the gains from t

Reviewer 02Rating 2Confidence 5

Strengths

- Empirically thorough with extensive dataset evaluation and ablations. - The iterative W2S design is well executed, and the results show consistent gains, especially on OOD datasets. - The idea of integrating multiple weak teachers (real and synthetic distortion) is practical and potentially impactful for scaling VQA.

Weaknesses

- The core idea of the paper, W2S generalization for VQA, is essentially just the model distillation. Which has been extensively used in IQA/VQA space, eg: [1,2,3]. Same goes for other components: ranking-based regression [4], and iterative self-teaching. There is little conceptual innovation beyond empirical confirmation that W2S helps in VQA, which has been previously explored. - The selection of exactly five weak models is not justified. Why these models? Were others tried or excluded (e.g.

Reviewer 03Rating 4Confidence 4

Strengths

- Systematic Framework Construction: The paper goes beyond simple model ensemble by proposing a comprehensive W2S framework that systematically combines "integration of multiple supervision signals" with "iterative training." This provides a concrete implementation path for exploring annotation-free learning paradigms in VQA. - Clear Empirical Contribution: The authors provide empirical evidence for the "weak-to-strong effect" in the VQA domain, demonstrating that a strong student model can sur

Weaknesses

- The methodological innovation is insufficient. Specifically, the approach of constructing datasets and using multiple QA models to generate weak labels has been explored in prior works (e.g., HEKE), which the authors fail to cite or discuss. In the context of large models, the novelty is not prominent, as the work largely follows existing ideas without significant breakthroughs. - The experimental results do not show clear or substantial improvements. The performance gains are only marginal a

Code & Models

Repositories

clh124/LMM-PVQA
github

Models

🤗
kkkkkklinhan/llava_qwen_slowfast_w2s_stage3
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage and Video Quality Assessment · Advanced Image Processing Techniques · Advanced Computing and Algorithms