CLIPVQA:Video Quality Assessment via CLIP

Fengchuang Xing; Mingjie Li; Yuan-Gen Wang; Guopu Zhu; and Xiaochun; Cao

arXiv:2407.04928·cs.CV·July 9, 2024

CLIPVQA:Video Quality Assessment via CLIP

Fengchuang Xing, Mingjie Li, Yuan-Gen Wang, Guopu Zhu, and Xiaochun, Cao

PDF

Open Access 1 Repo

TL;DR

CLIPVQA introduces a novel CLIP-based Transformer approach for video quality assessment, leveraging rich spatiotemporal features and language descriptions to achieve state-of-the-art performance and improved generalizability across diverse datasets.

Contribution

The paper presents a new CLIP-based Transformer framework for VQA that effectively integrates spatiotemporal features and language descriptions, outperforming existing methods.

Findings

01

Achieves state-of-the-art VQA performance on eight datasets.

02

Up to 37% better generalizability than benchmark methods.

03

Validates effectiveness through comprehensive ablation studies.

Abstract

In learning vision-language representations from web-scale data, the contrastive language-image pre-training (CLIP) mechanism has demonstrated a remarkable performance in many vision tasks. However, its application to the widely studied video quality assessment (VQA) task is still an open issue. In this paper, we propose an efficient and effective CLIP-based Transformer method for the VQA problem (CLIPVQA). Specifically, we first design an effective video frame perception paradigm with the goal of extracting the rich spatiotemporal quality and content information among video frames. Then, the spatiotemporal quality features are adequately integrated together using a self-attention mechanism to yield video-level quality representation. To utilize the quality language descriptions of videos for supervision, we develop a CLIP-based encoder for language embedding, which is then fully…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

GZHU-DVL/CLIPVQA
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Advanced Data Compression Techniques

MethodsLinear Layer · Multi-Head Attention · Attention Is All You Need · Softmax · Byte Pair Encoding · Layer Normalization · Concatenated Skip Connection · Label Smoothing · Absolute Position Encodings · Position-Wise Feed-Forward Layer