Optimizing Speech Multi-View Feature Fusion through Conditional   Computation

Weiqiao Shan; Yuhao Zhang; Yuchen Han; Bei Li; Xiaofeng Zhao; Yuang; Li; Min Zhang; Hao Yang; Tong Xiao; Jingbo Zhu

arXiv:2501.08057·eess.AS·January 15, 2025

Optimizing Speech Multi-View Feature Fusion through Conditional Computation

Weiqiao Shan, Yuhao Zhang, Yuchen Han, Bei Li, Xiaofeng Zhao, Yuang, Li, Min Zhang, Hao Yang, Tong Xiao, Jingbo Zhu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a generalized feature fusion framework using conditional computation to effectively combine SSL and spectral speech features, improving convergence speed and robustness in speech translation tasks.

Contribution

It proposes a novel gradient-sensitive gating network with multi-stage dropout for conflict mitigation in multi-view feature fusion.

Findings

01

Accelerates model convergence with combined features

02

Maintains competitive performance across speech translation tasks

03

Enhances robustness to multi-view feature conflicts

Abstract

Recent advancements have highlighted the efficacy of self-supervised learning (SSL) features in various speech-related tasks, providing lightweight and versatile multi-view speech representations. However, our study reveals that while SSL features expedite model convergence, they conflict with traditional spectral features like FBanks in terms of update directions. In response, we propose a novel generalized feature fusion framework grounded in conditional computation, featuring a gradient-sensitive gating network and a multi-stage dropout strategy. This framework mitigates feature conflicts and bolsters model robustness to multi-view input features. By integrating SSL and spectral features, our approach accelerates convergence and maintains performance on par with spectral models across multiple speech translation tasks on the MUSTC dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shanweiqiao/gsgn
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing

MethodsDropout