MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling
Yifan Cheng, Ruoyi Zhang, Jiatong Shi

TL;DR
MIKU-PAL is an automated multimodal pipeline that extracts high-quality, consistent emotional speech labels from unlabeled videos, enabling large-scale emotional speech dataset creation with human-level accuracy and efficiency.
Contribution
The paper introduces MIKU-PAL, a novel fully automated multimodal system that achieves high accuracy and consistency in emotional speech labeling, surpassing traditional manual annotation methods.
Findings
Achieves 68.5% accuracy on MELD dataset.
Attains 0.93 Fleiss kappa score for consistency.
Successfully annotates 26 fine-grained emotion categories.
Abstract
Acquiring large-scale emotional speech data with strong consistency remains a challenge for speech synthesis. This paper presents MIKU-PAL, a fully automated multimodal pipeline for extracting high-consistency emotional speech from unlabeled video data. Leveraging face detection and tracking algorithms, we developed an automatic emotion analysis system using a multimodal large language model (MLLM). Our results demonstrate that MIKU-PAL can achieve human-level accuracy (68.5% on MELD) and superior consistency (0.93 Fleiss kappa score) while being much cheaper and faster than human annotation. With the high-quality, flexible, and consistent annotation from MIKU-PAL, we can annotate fine-grained speech emotion categories of up to 26 types, validated by human annotators with 83% rationality ratings. Based on our proposed system, we further released a fine-grained emotional speech dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
