Selective Attention System (SAS): Device-Addressed Speech Detection for Real-Time On-Device Voice AI

David Joohun Kim; Daniyal Anjum; Bonny Banerjee; Omar Abbasi

arXiv:2604.08412·cs.SD·April 10, 2026

Selective Attention System (SAS): Device-Addressed Speech Detection for Real-Time On-Device Voice AI

David Joohun Kim, Daniyal Anjum, Bonny Banerjee, Omar Abbasi

PDF

TL;DR

This paper introduces SAS, an on-device speech detection system for multi-speaker environments that models device-addressed routing as a sequential problem, achieving high accuracy with low latency.

Contribution

The paper formalizes Sequential Device-Addressed Routing (SDAR) and presents SAS, a novel on-device implementation that effectively utilizes interaction history for speech detection.

Findings

01

SAS achieves F1=0.86 with audio-only and F1=0.95 with audio+video on a 60-hour test set.

02

Removing interaction history significantly reduces F1 from 0.95 to 0.57+/-0.03.

03

SAS runs fully on-device with <150 ms latency and <20 MB footprint.

Abstract

We study device-addressed speech detection under pre-ASR edge deployment constraints, where systems must decide whether to forward audio before transcription under strict latency and compute limits. We show that, in multi-speaker environments with temporally ambiguous utterances, this task is more effectively modelled as a sequential routing problem over interaction history than as an utterance-local classification task. We formalize this as Sequential Device-Addressed Routing (SDAR) and present the Selective Attention System (SAS), an on-device implementation that instantiates this formulation. On a held-out 60-hour multi-speaker English test set, the primary audio-only configuration achieves F1=0.86 (precision=0.89, recall=0.83); with an optional camera, audio+video fusion raises F1 to 0.95 (precision=0.97, recall=0.93). Removing causal interaction history (Stage~3) reduced F1 from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.