Exploration on HuBERT with Multiple Resolutions

Jiatong Shi; Yun Tang; Hirofumi Inaguma; Hongyu GOng; Juan Pino,; Shinji Watanabe

arXiv:2306.01084·cs.SD·June 26, 2023·1 cites

Exploration on HuBERT with Multiple Resolutions

Jiatong Shi, Yun Tang, Hirofumi Inaguma, Hongyu GOng, Juan Pino,, Shinji Watanabe

PDF

Open Access

TL;DR

This paper investigates enhancing HuBERT, a self-supervised speech model, by integrating multiple temporal resolutions of its hidden representations to better capture diverse speech attributes for various tasks.

Contribution

It introduces a novel approach to incorporate multi-resolution features into HuBERT, improving its performance over the standard fixed-resolution model.

Findings

01

Multi-resolution HuBERT outperforms the original model.

02

Parallel and hierarchical methods effectively integrate different resolutions.

03

Multi-resolution approach captures diverse speech information.

Abstract

Hidden-unit BERT (HuBERT) is a widely-used self-supervised learning (SSL) model in speech processing. However, we argue that its fixed 20ms resolution for hidden representations would not be optimal for various speech-processing tasks since their attributes (e.g., speaker characteristics and semantics) are based on different time scales. To address this limitation, we propose utilizing HuBERT representations at multiple resolutions for downstream tasks. We explore two approaches, namely the parallel and hierarchical approaches, for integrating HuBERT features with different resolutions. Through experiments, we demonstrate that HuBERT with multiple resolutions outperforms the original model. This highlights the potential of utilizing multiple resolutions in SSL models like HuBERT to capture diverse information from speech signals.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing