Synaspot: A Lightweight, Streaming Multi-modal Framework for Keyword Spotting with Audio-Text Synergy

Kewei Li; Yinan Zhong; Xiaotao Liang; Tianchi Dai; Shaofei Xue

arXiv:2512.15124·cs.SD·December 18, 2025

Synaspot: A Lightweight, Streaming Multi-modal Framework for Keyword Spotting with Audio-Text Synergy

Kewei Li, Yinan Zhong, Xiaotao Liang, Tianchi Dai, Shaofei Xue

PDF

Open Access

TL;DR

This paper introduces Synaspot, a lightweight streaming multi-modal framework for keyword spotting that effectively fuses audio and text features, reducing parameters and improving performance in continuous speech streams.

Contribution

It presents a novel multimodal framework that reduces speaker-specific information, efficiently fuses speech and text, and enables streaming decoding with fewer parameters.

Findings

01

Outperforms existing streaming methods in accuracy

02

Uses significantly fewer parameters

03

Effective fusion of audio and text modalities

Abstract

Open-vocabulary keyword spotting (KWS) in continuous speech streams holds significant practical value across a wide range of real-world applications. While increasing attention has been paid to the role of different modalities in KWS, their effectiveness has been acknowledged. However, the increased parameter cost from multimodal integration and the constraints of end-to-end deployment have limited the practical applicability of such models. To address these challenges, we propose a lightweight, streaming multi-modal framework. First, we focus on multimodal enrollment features and reduce speaker-specific (voiceprint) information in the speech enrollment to extract speaker-irrelevant characteristics. Second, we effectively fuse speech and text features. Finally, we introduce a streaming decoding framework that only requires the encoder to extract features, which are then mathematically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems