Engagement Prediction of Short Videos with Large Multimodal Models

Wei Sun; Linhan Cao; Yuqin Cao; Weixia Zhang; Wen Wen; Kaiwei Zhang; Zijian Chen; Fangfang Lu; Xiongkuo Min; Guangtao Zhai

arXiv:2508.02516·cs.CV·August 12, 2025

Engagement Prediction of Short Videos with Large Multimodal Models

Wei Sun, Linhan Cao, Yuqin Cao, Weixia Zhang, Wen Wen, Kaiwei Zhang, Zijian Chen, Fangfang Lu, Xiongkuo Min, Guangtao Zhai

PDF

Open Access

TL;DR

This paper explores the use of large multimodal models for predicting engagement in short videos, demonstrating their effectiveness and achieving top performance in a relevant challenge.

Contribution

It empirically evaluates two large multimodal models for video engagement prediction, highlighting the importance of audio features and ensemble methods.

Findings

01

VideoLLaMA2 outperforms Qwen2.5-VL in engagement prediction.

02

Inclusion of audio features improves model performance.

03

Achieved first place in the ICCV VQualA 2025 EVQA-SnapUGC Challenge.

Abstract

The rapid proliferation of user-generated content (UGC) on short-form video platforms has made video engagement prediction increasingly important for optimizing recommendation systems and guiding content creation. However, this task remains challenging due to the complex interplay of factors such as semantic content, visual quality, audio characteristics, and user background. Prior studies have leveraged various types of features from different modalities, such as visual quality, semantic content, background sound, etc., but often struggle to effectively model their cross-feature and cross-modality interactions. In this work, we empirically investigate the potential of large multimodal models (LMMs) for video engagement prediction. We adopt two representative LMMs: VideoLLaMA2, which integrates audio, visual, and language modalities, and Qwen2.5-VL, which models only visual and language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Music and Audio Processing