Large-Scale Automatic Audiobook Creation

Brendan Walsh; Mark Hamilton; Greg Newby; Xi Wang; Serena Ruan; Sheng; Zhao; Lei He; Shaofei Zhang; Eric Dettinger; William T. Freeman; Markus; Weimer

arXiv:2309.03926·cs.SD·September 11, 2023·1 cites

Large-Scale Automatic Audiobook Creation

Brendan Walsh, Mark Hamilton, Greg Newby, Xi Wang, Serena Ruan, Sheng, Zhao, Lei He, Shaofei Zhang, Eric Dettinger, William T. Freeman, Markus, Weimer

PDF

Open Access

TL;DR

This paper presents an automated system that leverages neural text-to-speech technology to efficiently produce high-quality, customizable audiobooks from online e-books, significantly reducing manual effort and enabling large-scale audiobook creation.

Contribution

It introduces a scalable, automated pipeline for generating open-license audiobooks from diverse e-books, including features for customization and voice matching, with over five thousand audiobooks released.

Findings

01

Created over 5,000 open-license audiobooks

02

Enabled user customization of voice and style

03

Operated on hundreds of books in parallel

Abstract

An audiobook can dramatically improve a work of literature's accessibility and improve reader engagement. However, audiobooks can take hundreds of hours of human effort to create, edit, and publish. In this work, we present a system that can automatically generate high-quality audiobooks from online e-books. In particular, we leverage recent advances in neural text-to-speech to create and release thousands of human-quality, open-license audiobooks from the Project Gutenberg e-book collection. Our method can identify the proper subset of e-book content to read for a wide collection of diversely structured books and can operate on hundreds of books in parallel. Our system allows users to customize an audiobook's speaking speed and style, emotional intonation, and can even match a desired voice using a small amount of sample audio. This work contributed over five thousand open-license…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization · Music Technology and Sound Studies

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings