A Multi-Agent AI Framework for Immersive Audiobook Production through Spatial Audio and Neural Narration
Shaja Arul Selvamani, Nia D'Souza Ganapathy

TL;DR
This paper presents an AI multi-agent framework that combines neural narration, spatial audio, and advanced sound modeling to produce highly immersive and realistic audiobooks, enhancing listener engagement and accessibility.
Contribution
The framework integrates neural text-to-speech, spatial audio effects, and generative sound models to automate and improve immersive audiobook production.
Findings
Enhanced realism of 3D soundscapes using diffusion models and HOA.
Automatic synchronization of spatial effects with narrative flow.
Significant improvement in listener immersion and narrative authenticity.
Abstract
This research introduces an innovative AI-driven multi-agent framework specifically designed for creating immersive audiobooks. Leveraging neural text-to-speech synthesis with FastSpeech 2 and VALL-E for expressive narration and character-specific voices, the framework employs advanced language models to automatically interpret textual narratives and generate realistic spatial audio effects. These sound effects are dynamically synchronized with the storyline through sophisticated temporal integration methods, including Dynamic Time Warping (DTW) and recurrent neural networks (RNNs). Diffusion-based generative models combined with higher-order ambisonics (HOA) and scattering delay networks (SDN) enable highly realistic 3D soundscapes, substantially enhancing listener immersion and narrative realism. This technology significantly advances audiobook applications, providing richer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Tactile and Sensory Interactions · Subtitles and Audiovisual Media
