MultiActor-Audiobook: Zero-Shot Audiobook Generation with Faces and Voices of Multiple Speakers
Kyeongman Park, Seongho Joo, Kyomin Jung

TL;DR
MultiActor-Audiobook is a zero-shot system that generates expressive, speaker-consistent audiobooks using novel multimodal and language model-based processes, eliminating the need for manual configuration or costly training.
Contribution
It introduces MSP and LSI processes enabling zero-shot, expressive audiobook generation with consistent prosody, advancing beyond prior manual or training-dependent methods.
Findings
Achieves competitive quality compared to commercial products
Demonstrates effective emotion and prosody control without additional training
Validates processes through ablation studies
Abstract
We introduce MultiActor-Audiobook, a zero-shot approach for generating audiobooks that automatically produces consistent, expressive, and speaker-appropriate prosody, including intonation and emotion. Previous audiobook systems have several limitations: they require users to manually configure the speaker's prosody, read each sentence with a monotonic tone compared to voice actors, or rely on costly training. However, our MultiActor-Audiobook addresses these issues by introducing two novel processes: (1) MSP (**Multimodal Speaker Persona Generation**) and (2) LSI (**LLM-based Script Instruction Generation**). With these two processes, MultiActor-Audiobook can generate more emotionally expressive audiobooks with a consistent speaker prosody without additional training. We compare our system with commercial products, through human and MLLM evaluations, achieving competitive results.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · AI in Service Interactions · Multimodal Machine Learning Applications
