TL;DR
This paper introduces FGSVQA, an end-to-end framework for assessing short-form video quality using frequency-guided features and a dense visual encoder, achieving high accuracy and efficiency.
Contribution
The paper presents a novel frequency-guided, structure-aware VQA framework that effectively handles complex distortions in user-generated short videos.
Findings
Achieves SRCC of 0.736 and PLCC of 0.787 on short-form video datasets.
Employs a frequency domain prior to improve artifact and structure awareness.
Maintains efficient inference runtime while delivering accurate quality predictions.
Abstract
Short-form video poses new challenges to the quality assessment of user-generated content (UGC) due to its complex generation pipeline, rapid content variation, and mixed distortions. To address this challenge, we propose an end-to-end video quality assessment (VQA) framework that employs a dense visual encoder based on CLIP, and incorporates compression priors derived from the frequency domain to generate artifact- and structure-aware weight maps for feature aggregation. By explicitly decomposing artifact, structure, and original visual feature branches and adaptively fusing them over time through a learned gating module, the proposed method achieves accurate and efficient quality prediction. Experimental results show that our method achieves strong performance on short-form video datasets in terms of average rank and linear correlation (SRCC: 0.736, PLCC: 0.787), while maintaining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
