EFFUSE: Efficient Self-Supervised Feature Fusion for E2E ASR in Low Resource and Multilingual Scenarios
Tejes Srivastava, Jiatong Shi, William Chen, Shinji Watanabe

TL;DR
EFFUSE introduces a lightweight, self-supervised feature fusion method that mimics multiple SSL models, significantly improving multilingual speech recognition performance while reducing computational costs.
Contribution
The paper presents EFFUSE, a novel single-model approach that efficiently fuses features of multiple SSL models without increasing parameter size.
Findings
Outperforms individual SSL models in multilingual speech recognition.
Achieves a 6.3% increase in SUPERB score over SSL baselines.
Reduces parameter size by 49% compared to model fusion methods.
Abstract
Self-Supervised Learning (SSL) models have demonstrated exceptional performance in various speech tasks, particularly in low-resource and multilingual domains. Recent works show that fusing diverse SSL models could achieve superior performance compared to using one SSL model. However, fusing models increases the overall parameter size, leading to higher computational costs. We propose EFFUSE, a novel approach that uses a single SSL model to mimic the features of multiple SSL models via prediction, resulting in a lightweight framework with competitive performance. Our experiments show that EFFUSE outperforms individual SSL models in multilingual speech recognition tasks. Our best performing model achieves an average SUPERB score increase of 63.5 (6.3%) from the SSL baselines in Multilingual Speech Universal PERformance Benchmark (ML-SUPERB), while decreasing parameter size on average by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Chemical Sensor Technologies · Fault Detection and Control Systems · Speech Recognition and Synthesis
