In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions

Xulin Fan; Vishal Sunder; Samuel Thomas; Mark Hasegawa-Johnson; Brian Kingsbury; George Saon

arXiv:2604.22817·eess.AS·April 28, 2026

In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions

Xulin Fan, Vishal Sunder, Samuel Thomas, Mark Hasegawa-Johnson, Brian Kingsbury, George Saon

PDF

5 Models

TL;DR

This paper presents a unified speech-aware language model that directly predicts word-level timestamps alongside transcripts, improving alignment accuracy and ASR performance.

Contribution

It introduces lightweight training strategies for joint timestamp and transcript prediction, enhancing robustness without sacrificing recognition quality.

Findings

01

Improved timestamp accuracy across multiple datasets.

02

Enhanced overall ASR performance with the proposed strategies.

03

Efficient unified approach to speech recognition and timestamp prediction.

Abstract

Recent advances in speech-aware language models have coupled strong acoustic encoders with large language models, enabling systems that move beyond transcription to produce richer outputs. Among these, word-level timestamp prediction is critical for applications such as captioning, media search, and multimodal synchronization, yet it is often handled by external alignment tools. In this work, we extend an existing speech-aware language model to predict timestamps directly alongside transcripts. We introduce a set of novel lightweight training strategies that improve alignment robustness while preserving recognition quality. Experiments across multiple datasets show that these strategies not only enhance timestamp accuracy, but also yield gains in overall ASR performance. Together, they demonstrate an efficient and unified approach to speech recognition with precise timestamp prediction.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.