COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework

Xin Dong; Sen Jia; Ming Rui Wang; Yan Li; Zhenheng Yang; Bingfeng Deng; Hongyu Xiong

arXiv:2412.10435·cs.CV·July 4, 2025

COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework

Xin Dong, Sen Jia, Ming Rui Wang, Yan Li, Zhenheng Yang, Bingfeng Deng, Hongyu Xiong

PDF

Open Access

TL;DR

COEF-VQ is a cascaded multimodal large language model framework that improves video quality understanding on short-video platforms by reducing GPU usage through an entropy-based pre-filtering stage, maintaining high accuracy.

Contribution

The paper introduces COEF-VQ, a novel framework combining lightweight pre-filtering with deep MLLMs to optimize computational efficiency in video quality classification.

Findings

01

Reduces GPU usage significantly while maintaining classification accuracy.

02

Decreases inappropriate content views by 9.9% in online tests.

03

Sustains performance improvements in real-world deployment.

Abstract

Recently, with the emergence of recent Multimodal Large Language Model (MLLM) technology, it has become possible to exploit its video understanding capability on different classification tasks. In practice, we face the difficulty of huge requirements for GPU resource if we need to deploy MLLMs online. In this paper, we propose COEF-VQ, a novel cascaded MLLM framework designed to enhance video quality understanding on the short-video platform while optimizing computational efficiency. Our approach integrates an entropy-based pre-filtering stage, where a lightweight model assesses uncertainty and selectively filters cases before passing them to the more computationally intensive MLLM for final evaluation. By prioritizing high-uncertainty samples for deeper analysis, our framework significantly reduces GPU usage while maintaining the strong classification performance of a full MLLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage and Video Quality Assessment · Image and Signal Denoising Methods · Advanced Image Processing Techniques