TL;DR
This paper presents a unified speech-aware language model that directly predicts word-level timestamps alongside transcripts, improving alignment accuracy and ASR performance.
Contribution
It introduces lightweight training strategies for joint timestamp and transcript prediction, enhancing robustness without sacrificing recognition quality.
Findings
Improved timestamp accuracy across multiple datasets.
Enhanced overall ASR performance with the proposed strategies.
Efficient unified approach to speech recognition and timestamp prediction.
Abstract
Recent advances in speech-aware language models have coupled strong acoustic encoders with large language models, enabling systems that move beyond transcription to produce richer outputs. Among these, word-level timestamp prediction is critical for applications such as captioning, media search, and multimodal synchronization, yet it is often handled by external alignment tools. In this work, we extend an existing speech-aware language model to predict timestamps directly alongside transcripts. We introduce a set of novel lightweight training strategies that improve alignment robustness while preserving recognition quality. Experiments across multiple datasets show that these strategies not only enhance timestamp accuracy, but also yield gains in overall ASR performance. Together, they demonstrate an efficient and unified approach to speech recognition with precise timestamp prediction.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ibm-granite/granite-speech-4.1-2bmodel· 326k dl· ♡ 103326k dl♡ 103
- 🤗ibm-granite/granite-speech-4.1-2b-plusmodel· 17k dl· ♡ 5517k dl♡ 55
- 🤗ibm-granite/granite-speech-4.1-2b-narmodel· 6.0k dl· ♡ 446.0k dl♡ 44
- 🤗valoomba/granite-speech-4.1-2b-plus-ONNXmodel· 72 dl· ♡ 172 dl♡ 1
- 🤗rikhoffbauer2/lyric-syncmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
