eMotions: A Large-Scale Dataset and Audio-Visual Fusion Network for Emotion Analysis in Short-form Videos

Xuecheng Wu; Dingkang Yang; Danlei Huang; Xinyi Yin; Yifan Wang; Jia Zhang; Jiayu Nie; Liangyu Fu; Yang Liu; Junxiao Xue; Hadi Amirpour; Wei Zhou

arXiv:2508.06902·cs.CV·August 12, 2025

eMotions: A Large-Scale Dataset and Audio-Visual Fusion Network for Emotion Analysis in Short-form Videos

Xuecheng Wu, Dingkang Yang, Danlei Huang, Xinyi Yin, Yifan Wang, Jia Zhang, Jiayu Nie, Liangyu Fu, Yang Liu, Junxiao Xue, Hadi Amirpour, Wei Zhou

PDF

Open Access

TL;DR

This paper introduces eMotions, a large-scale dataset for emotion analysis in short videos, and proposes AV-CANet, a multimodal fusion network leveraging video transformers and novel modules to improve emotion recognition accuracy.

Contribution

The paper provides the first large-scale, well-annotated dataset for short-form video emotion analysis and introduces a novel audio-visual fusion network with specialized modules for better multimodal feature integration.

Findings

01

AV-CANet outperforms existing methods on multiple datasets.

02

The Local-Global Fusion Module enhances audio-visual correlation modeling.

03

EP-CE Loss improves global optimization of emotion features.

Abstract

Short-form videos (SVs) have become a vital part of our online routine for acquiring and sharing information. Their multimodal complexity poses new challenges for video analysis, highlighting the need for video emotion analysis (VEA) within the community. Given the limited availability of SVs emotion data, we introduce eMotions, a large-scale dataset consisting of 27,996 videos with full-scale annotations. To ensure quality and reduce subjective bias, we emphasize better personnel allocation and propose a multi-stage annotation procedure. Additionally, we provide the category-balanced and test-oriented variants through targeted sampling to meet diverse needs. While there have been significant studies on videos with clear emotional cues (e.g., facial expressions), analyzing emotions in SVs remains a challenging task. The challenge arises from the broader content diversity, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization