MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling

Yifan Cheng; Ruoyi Zhang; Jiatong Shi

arXiv:2505.15772·cs.SD·October 1, 2025

MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling

Yifan Cheng, Ruoyi Zhang, Jiatong Shi

PDF

2 Datasets

TL;DR

MIKU-PAL is an automated multimodal pipeline that extracts high-quality, consistent emotional speech labels from unlabeled videos, enabling large-scale emotional speech dataset creation with human-level accuracy and efficiency.

Contribution

The paper introduces MIKU-PAL, a novel fully automated multimodal system that achieves high accuracy and consistency in emotional speech labeling, surpassing traditional manual annotation methods.

Findings

01

Achieves 68.5% accuracy on MELD dataset.

02

Attains 0.93 Fleiss kappa score for consistency.

03

Successfully annotates 26 fine-grained emotion categories.

Abstract

Acquiring large-scale emotional speech data with strong consistency remains a challenge for speech synthesis. This paper presents MIKU-PAL, a fully automated multimodal pipeline for extracting high-consistency emotional speech from unlabeled video data. Leveraging face detection and tracking algorithms, we developed an automatic emotion analysis system using a multimodal large language model (MLLM). Our results demonstrate that MIKU-PAL can achieve human-level accuracy (68.5% on MELD) and superior consistency (0.93 Fleiss kappa score) while being much cheaper and faster than human annotation. With the high-quality, flexible, and consistent annotation from MIKU-PAL, we can annotate fine-grained speech emotion categories of up to 26 types, validated by human annotators with 83% rationality ratings. Based on our proposed system, we further released a fine-grained emotional speech dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.