MaLa-ASR: Multimedia-Assisted LLM-Based ASR

Guanrou Yang; Ziyang Ma; Fan Yu; Zhifu Gao; Shiliang Zhang; Xie Chen

arXiv:2406.05839·eess.AS·November 12, 2024

MaLa-ASR: Multimedia-Assisted LLM-Based ASR

Guanrou Yang, Ziyang Ma, Fan Yu, Zhifu Gao, Shiliang Zhang, Xie Chen

PDF

Open Access 1 Repo

TL;DR

MaLa-ASR is a novel LLM-based automatic speech recognition model that leverages multimedia auxiliary information, such as presentation slide keywords, to significantly improve recognition accuracy on conference speech datasets.

Contribution

This paper introduces MaLa-ASR, the first LLM-based ASR model that effectively integrates textual auxiliary data to enhance speech recognition performance.

Findings

01

Achieves average WERs of 9.4% and 11.7% on SlideSpeech subsets.

02

Reduces biased word error rate (B-WER) by over 44%.

03

Sets new state-of-the-art results on the dataset.

Abstract

As more and more information-rich data like video become available, utilizing multi-modal auxiliary information to enhance audio tasks has sparked widespread research interest. The recent surge in research on LLM-based audio models provides fresh perspectives for tackling audio tasks. Given that LLM can flexibly ingest multiple inputs, we propose MaLa-ASR, an LLM-based ASR model that can integrate textual keywords extracted from presentation slides to improve recognition of conference content. MaLa-ASR yields average WERs of 9.4% and 11.7% on the L95 and S95 subsets of the SlideSpeech corpus, representing a significant relative WER drop of 27.9% and 44.7% over the baseline model reported in SlideSpeech. MaLa-ASR underscores LLM's strong performance in speech tasks and the capability to integrate auxiliary information conveniently. By adding keywords to the input prompt, the biased word…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

X-LANCE/SLAM-LLM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques