TorchAudio 2.1: Advancing speech recognition, self-supervised learning,   and audio processing components for PyTorch

Jeff Hwang; Moto Hira; Caroline Chen; Xiaohui Zhang; Zhaoheng Ni,; Guangzhi Sun; Pingchuan Ma; Ruizhe Huang; Vineel Pratap; Yuekai Zhang; Anurag; Kumar; Chin-Yun Yu; Chuang Zhu; Chunxi Liu; Jacob Kahn; Mirco Ravanelli; Peng; Sun; Shinji Watanabe; Yangyang Shi; Yumeng Tao; Robin Scheibler; Samuele; Cornell; Sean Kim; Stavros Petridis

arXiv:2310.17864·eess.AS·October 30, 2023·1 cites

TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, Zhaoheng Ni,, Guangzhi Sun, Pingchuan Ma, Ruizhe Huang, Vineel Pratap, Yuekai Zhang, Anurag, Kumar, Chin-Yun Yu, Chuang Zhu, Chunxi Liu, Jacob Kahn, Mirco Ravanelli, Peng, Sun, Shinji Watanabe, Yangyang Shi, Yumeng Tao

PDF

Open Access 1 Repo

TL;DR

TorchAudio 2.1 enhances PyTorch's audio and speech processing capabilities with new features like self-supervised learning, high-performance decoders, and advanced media tools, supported by empirical performance demonstrations.

Contribution

Introduces TorchAudio 2.1 with new self-supervised pipelines, high-performance decoders, and advanced audio tools, improving ease of use and performance for speech and audio research.

Findings

01

Self-supervised learning pipelines achieve competitive results.

02

High-performance CTC decoders improve speech recognition accuracy.

03

Empirical studies demonstrate state-of-the-art performance of new features.

Abstract

TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by developing impactful features. Here, we survey TorchAudio's development principles and contents and highlight key features we include in its latest version (2.1): self-supervised learning pre-trained pipelines and training recipes, high-performance CTC decoders, speech recognition models and training recipes, advanced media I/O capabilities, and tools for performing forced alignment, multi-channel speech enhancement, and reference-less speech assessment. For a selection of these features, through empirical studies, we demonstrate their efficacy and show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pytorch/audio
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis

MethodsLib