Context and Pixel Aware Large Language Model for Video Quality Assessment
Wen Wen, Yaohong Wu, Yue Sheng, Neil Birkbeck, Balu Adsumilli, Yilin Wang

TL;DR
This paper introduces CP-LLM, a novel multimodal large language model that combines dual vision encoders and a language decoder to improve video quality assessment by capturing both contextual and pixel-level distortions.
Contribution
The paper presents a new architecture, CP-LLM, that independently analyzes high-level context and low-level pixel distortions for better video quality assessment.
Findings
Achieves state-of-the-art performance on VQA benchmarks.
Demonstrates superior robustness to pixel distortions.
Produces both quality scores and interpretable quality descriptions.
Abstract
Video quality assessment (VQA) is a challenging research topic with broad applications. Traditional hand-crafted and discriminative learning-based VQA models mainly focus on pixel-level distortions and lack contextual understanding, while recent multimodal large language models (MLLMs) struggle with sensitivity to small distortions or handle quality scoring and description as separate tasks. To address these shortcomings, we introduce CP-LLM: a Context- and Pixel-aware Large Language Model. CP-LLM is a novel multimodal LLM architecture featuring dual vision encoders designed to independently analyze perceptual quality at both high-level (video context) and low-level (pixel distortion) granularity, along with a language decoder that subsequently reasons about the interplay between these aspects. This design enables CP-LLM to simultaneously produce robust quality scores and interpretable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
