TL;DR
This paper enhances speech recognition by integrating speaker identity tags directly into transcripts using a speech-aware LLM, improving accuracy with minimal architectural changes and data augmentation.
Contribution
It introduces speaker cluster identification tags and a data augmentation method, enabling effective adaptation of speech-aware LLMs for speaker-attributed ASR.
Findings
Significant accuracy improvements over traditional pipelines.
Effective use of minimal architectural modifications.
Successful evaluation across multiple benchmarks.
Abstract
Speaker-Attributed Automatic Speech Recognition (SAA) enhances traditional ASR systems by incorporating relative speaker identity tags directly into the transcript (e.g., [Speaker 1]:, [Speaker 2]:). In this work, we extend the capabilities of Granite-speech, a state-of-the-art speech-aware Large Language Model (LLM) originally trained for transcription and translation. We demonstrate that it can be effectively adapted for SAA with only minimal architectural changes. Our core contribution is the introduction of speaker cluster identification tags (e.g., [Speaker 1 cluster 42]:) which are jointly trained with SAA to significantly improve accuracy. To address limitations in training data, we propose a data augmentation method that uses artificially concatenated multi-speaker conversations. Our approach is evaluated across multiple benchmarks and shows superior performance compared to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ibm-granite/granite-speech-4.1-2bmodel· 374k dl· ♡ 103374k dl♡ 103
- 🤗ibm-granite/granite-speech-4.1-2b-plusmodel· 17k dl· ♡ 5617k dl♡ 56
- 🤗ibm-granite/granite-speech-4.1-2b-narmodel· 6.8k dl· ♡ 446.8k dl♡ 44
- 🤗konszvi/granite-speech-4.1-2b-plus2model· 235 dl235 dl
- 🤗valoomba/granite-speech-4.1-2b-plus-ONNXmodel· 72 dl· ♡ 172 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
