Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the   Way Forward

Shashi Kumar; Iuliia Thorbecke; Sergio Burdisso; Esa\'u; Villatoro-Tello; Manjunath K E; Kadri Hacio\u{g}lu; Pradeep Rangappa; Petr; Motlicek; Aravind Ganapathiraju; Andreas Stolcke

arXiv:2411.03866·cs.CL·January 23, 2025

Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward

Shashi Kumar, Iuliia Thorbecke, Sergio Burdisso, Esa\'u, Villatoro-Tello, Manjunath K E, Kadri Hacio\u{g}lu, Pradeep Rangappa, Petr, Motlicek, Aravind Ganapathiraju, Andreas Stolcke

PDF

Open Access

TL;DR

This paper evaluates SLAM-ASR, a simple yet promising speech recognition approach, revealing its limitations in cross-domain robustness and under speech perturbations, and providing insights for improving LLM-based ASR systems.

Contribution

The paper provides a comprehensive empirical analysis of SLAM-ASR, highlighting its weaknesses and offering guidance for enhancing robustness across diverse speech scenarios.

Findings

01

SLAM-ASR performs poorly in cross-domain evaluations.

02

Speech perturbations like noise and rate changes degrade performance.

03

Insights for better fine-tuning and configuration of LLM-based ASR models.

Abstract

Recent research has demonstrated that training a linear connector between speech foundation encoders and large language models (LLMs) enables this architecture to achieve strong ASR capabilities. Despite the impressive results, it remains unclear whether these simple approaches are robust enough across different scenarios and speech conditions, such as domain shifts and speech perturbations. In this paper, we address these questions by conducting various ablation experiments using a recent and widely adopted approach called SLAM-ASR. We present novel empirical findings that offer insights on how to effectively utilize the SLAM-ASR architecture across a wide range of settings. Our main findings indicate that SLAM-ASR exhibits poor performance in cross-domain evaluation settings. Additionally, speech perturbations on in-domain data, such as changes in speech rate or additive noise, can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Automated Systems · Underwater Vehicles and Communication Systems · Modular Robots and Swarm Intelligence

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings