EmotionTalk: An Interactive Chinese Multimodal Emotion Dataset With Rich Annotations
Haoqin Sun, Xuechen Wang, Jinghua Zhao, Shiwan Zhao, Jiaming Zhou, Hui Wang, Jiabei He, Aobo Kong, Xi Yang, Yequan Wang, Yonghua Lin, Yong Qin

TL;DR
EmotionTalk is a comprehensive Chinese multimodal emotion dataset with rich annotations, including acoustic, visual, and textual data, designed to advance emotion recognition research in Chinese dialogue contexts.
Contribution
This work introduces the first high-quality, versatile Chinese dialogue multimodal emotion dataset with extensive annotations for emotion, sentiment, and speech captioning.
Findings
Demonstrates the dataset's effectiveness in emotion recognition tasks
Provides a new resource for cross-cultural emotion analysis
Facilitates research on multimodal and missing modality challenges
Abstract
In recent years, emotion recognition plays a critical role in applications such as human-computer interaction, mental health monitoring, and sentiment analysis. While datasets for emotion analysis in languages such as English have proliferated, there remains a pressing need for high-quality, comprehensive datasets tailored to the unique linguistic, cultural, and multimodal characteristics of Chinese. In this work, we propose \textbf{EmotionTalk}, an interactive Chinese multimodal emotion dataset with rich annotations. This dataset provides multimodal information from 19 actors participating in dyadic conversational settings, incorporating acoustic, visual, and textual modalities. It includes 23.6 hours of speech (19,250 utterances), annotations for 7 utterance-level emotion categories (happy, surprise, sad, disgust, anger, fear, and neutral), 5-dimensional sentiment labels (negative,…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1) The use of professional actors, theme-driven improvisation and a controlled recording environment produces more naturalistic expressions than scraping TV/movies. 2) Separate modality-specific labels, confidence scores, and Fleiss' Kappa scores demonstrate the seriousness of the annotation effort across modalities. 3) The evaluation of encoders and fusion methods provides a useful point of reference for future works.
1) DeepSeek-R1 generates emotional captions for speech and presents them as authoritative annotations. However, the process of human verification is unclear, as are the methods used to agree on the quality of captions, verify semantic accuracy, and research preferences. This raises a significant risk of LLM hallucinations or stylistic artifacts appearing in the corpus. 2) Despite the severe class imbalance, the evaluation relies heavily on accuracy. 3) Although the paper highlights cross-modal l
The results of the experiment are promising.
1. The authors emphasize the dataset's novelty by highlighting its high-quality recordings conducted by professional actors, improvisational dialogue design, and comprehensive multimodal annotations across text, audio, and video. However, the paper could further clarify how the novelty of EmotionTalk compares to existing datasets beyond general claims. For instance, while the improvisational approach is presented as a novel methodology to enhance emotional authenticity, the paper lacks quantitat
1. High-quality and novel resource: it addresses the severe lack of large-scale Chinese multimodal datasets for emotion analysis. 2. Rich and well-structured annotation scheme: it contains multi-level labeling (discrete, continuous, and descriptive captions) enables diverse tasks beyond classification. 3. Comprehensive experiments and baselines: it evaluates over 20 models (speech, vision, and text) and multiple fusion methods (TFN, MISA, LMF, etc.).
1. Overemphasis on dataset construction, limited methodological novelty: the work mainly contributes a dataset; experimental analysis (fusion and recognition models) is mostly confirmatory and relies on existing architectures. 2. Lack of comparison to other Chinese datasets: while CH-SIMS, M3ED, and MC-EIUch are mentioned, quantitative comparisons (data quality, annotation agreement, or inter-modal correlation) are missing. 3. Emotion captioning evaluation lacks human judgment: The automatic m
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Sentiment Analysis and Opinion Mining · Mental Health via Writing
