VideoAesBench: Benchmarking the Video Aesthetics Perception Capabilities of Large Multimodal Models

Yunhao Li; Sijing Wu; Zhilin Gao; Zicheng Zhang; Qi Jia; Huiyu Duan; Xiongkuo Min; Guangtao Zhai

arXiv:2601.21915·cs.CV·February 3, 2026

VideoAesBench: Benchmarking the Video Aesthetics Perception Capabilities of Large Multimodal Models

Yunhao Li, Sijing Wu, Zhilin Gao, Zicheng Zhang, Qi Jia, Huiyu Duan, Xiongkuo Min, Guangtao Zhai

PDF

Open Access

TL;DR

VideoAesBench is a comprehensive benchmark designed to evaluate large multimodal models' understanding of video aesthetic quality across diverse content, question formats, and aesthetic dimensions, revealing current models' limited capabilities.

Contribution

The paper introduces VideoAesBench, a novel, diverse, and multi-faceted benchmark for assessing LMMs' video aesthetic perception, filling a gap in evaluation tools.

Findings

01

Current LMMs show limited video aesthetic perception abilities.

02

Performance of models remains incomplete and imprecise.

03

VideoAesBench provides a new platform for explainable video aesthetics assessment.

Abstract

Large multimodal models (LMMs) have demonstrated outstanding capabilities in various visual perception tasks, which has in turn made the evaluation of LMMs significant. However, the capability of video aesthetic quality assessment, which is a fundamental ability for human, remains underexplored for LMMs. To address this, we introduce VideoAesBench, a comprehensive benchmark for evaluating LMMs' understanding of video aesthetic quality. VideoAesBench has several significant characteristics: (1) Diverse content including 1,804 videos from multiple video sources including user-generated (UGC), AI-generated (AIGC), compressed, robotic-generated (RGC), and game videos. (2) Multiple question formats containing traditional single-choice questions, multi-choice questions, True or False questions, and a novel open-ended questions for video aesthetics description. (3) Holistic video aesthetics…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Generative Adversarial Networks and Image Synthesis · Aesthetic Perception and Analysis