Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment
Jiaze Li, Haoran Xu, Shiding Zhu, Junwei He, Haozhao Wang

TL;DR
This paper introduces MSA-VQA, a hierarchical, semantic-aware model leveraging CLIP for assessing AI-generated video quality, achieving state-of-the-art performance through multi-level analysis and semantic supervision.
Contribution
The paper presents a novel multilevel framework with semantic supervision and mutation-aware modules specifically designed for AI-generated video quality assessment.
Findings
Achieves state-of-the-art results on video quality benchmarks.
Effectively captures semantic consistency and subtle frame variations.
Demonstrates robustness across different AI-generated video datasets.
Abstract
The rapid development of diffusion models has greatly advanced AI-generated videos in terms of length and consistency recently, yet assessing AI-generated videos still remains challenging. Previous approaches have often focused on User-Generated Content(UGC), but few have targeted AI-Generated Video Quality Assessment methods. In this work, we introduce MSA-VQA, a Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment, which leverages CLIP-based semantic supervision and cross-attention mechanisms. Our hierarchical framework analyzes video content at three levels: frame, segment, and video. We propose a Prompt Semantic Supervision Module using text encoder of CLIP to ensure semantic consistency between videos and conditional prompts. Additionally, we propose the Semantic Mutation-aware Module to capture subtle variations between frames. Extensive experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Video Quality Assessment · Visual Attention and Saliency Detection
MethodsDiffusion · Contrastive Language-Image Pre-training
