TL;DR
This paper introduces a self-supervised learning framework for video quality assessment that leverages ranking-based methods and iterative self-improvement, enabling high performance without extensive manual annotations.
Contribution
The authors propose a novel self-supervised VQA approach using ranking-based learning and iterative refinement, trained on a large-scale unlabeled dataset, surpassing previous models in generalization and performance.
Findings
Achieves zero-shot performance comparable or better than supervised models.
Demonstrates strong out-of-distribution generalization across diverse videos.
Sets new state-of-the-art results when fine-tuned on labeled datasets.
Abstract
Video quality assessment (VQA) is essential for quantifying perceptual quality in various video processing workflows, spanning from camera capture systems to over-the-top streaming platforms. While recent supervised VQA models have made substantial progress, the reliance on manually annotated datasets -- a process that is labor-intensive, costly, and difficult to scale up -- has hindered further optimization of their generalization to unseen video content and distortions. To bridge this gap, we introduce a self-supervised learning framework for VQA to learn quality assessment capabilities from large-scale, unlabeled web videos. Our approach leverages a \textbf{learning-to-rank} paradigm to train a large multimodal model (LMM) on video pairs automatically labeled via two manners, including quality pseudo-labeling by existing VQA models and relative quality ranking based on synthetic…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The application of weak-to-strong learning to VQA is novel and significant. It presents a scalable approach to move beyond the limitations of traditional supervised learning. - The paper's primary strength lies in its empirical results. The final model demonstrates substantial performance gains on OOD datasets compared to all baselines, including the SOTA teacher models it learns from. This directly supports the central claim that the proposed method improves generalization. - The iterative se
- The framing relies on a "weak" teacher, but the selected teachers are, in fact, the current SOTA VQA models (e.g., DOVER, Q-Align). The work is less about learning from a *weak* model and more about distilling an ensemble of experts into a single, higher-capacity student. This reliance on a suite of pre-trained SOTA models is a importent prerequisite. - The student model is an ~8B parameter LMM, while the teachers are orders of magnitude smaller. It is difficult to disentangle the gains from t
- Empirically thorough with extensive dataset evaluation and ablations. - The iterative W2S design is well executed, and the results show consistent gains, especially on OOD datasets. - The idea of integrating multiple weak teachers (real and synthetic distortion) is practical and potentially impactful for scaling VQA.
- The core idea of the paper, W2S generalization for VQA, is essentially just the model distillation. Which has been extensively used in IQA/VQA space, eg: [1,2,3]. Same goes for other components: ranking-based regression [4], and iterative self-teaching. There is little conceptual innovation beyond empirical confirmation that W2S helps in VQA, which has been previously explored. - The selection of exactly five weak models is not justified. Why these models? Were others tried or excluded (e.g.
- Systematic Framework Construction: The paper goes beyond simple model ensemble by proposing a comprehensive W2S framework that systematically combines "integration of multiple supervision signals" with "iterative training." This provides a concrete implementation path for exploring annotation-free learning paradigms in VQA. - Clear Empirical Contribution: The authors provide empirical evidence for the "weak-to-strong effect" in the VQA domain, demonstrating that a strong student model can sur
- The methodological innovation is insufficient. Specifically, the approach of constructing datasets and using multiple QA models to generate weak labels has been explored in prior works (e.g., HEKE), which the authors fail to cite or discuss. In the context of large models, the novelty is not prominent, as the work largely follows existing ideas without significant breakthroughs. - The experimental results do not show clear or substantial improvements. The performance gains are only marginal a
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Video Quality Assessment · Advanced Image Processing Techniques · Advanced Computing and Algorithms
