Tap-to-Adapt: Learning User-Aligned Response Timing for Speech Agents

Zihong He; Hai-Ning Liang; Chen Liang

arXiv:2603.14449·cs.HC·March 17, 2026

Tap-to-Adapt: Learning User-Aligned Response Timing for Speech Agents

Zihong He, Hai-Ning Liang, Chen Liang

PDF

Open Access

TL;DR

This paper introduces Tap-to-Adapt, a framework that uses tap interactions for online learning of response timing in speech agents, improving alignment with user intent through data-driven methods.

Contribution

It presents a novel tap-based online learning framework for response timing in speech agents, incorporating Dilated TCN and replay strategies, with extensive user data collection.

Findings

01

Effective response timing models learned from user taps

02

Improved alignment with user intent demonstrated in experiments

03

Collected 20,000 samples from user studies

Abstract

Response timing judgment is a critical component of interactive speech agents. Although there exists substantial prior work on turn modeling and voice wake-up, there is a lack of research on response timing judgments continuously aligned with user intent. To address this, we propose the Tap-to-Adapt framework, which enables users to naturally activate or interrupt the agent via tap interactions to construct online learning labels for response timing models. Under this framework, Dilated TCN and a sequential replay strategy play significant roles, as demonstrated through data-driven experiments and user studies. Additionally, we develop an evaluation and continuous data mining system tailored for the Tap-to-Adapt framework, through which we have collected approximately 20,000 samples from the user studies involving 20 participants.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Social Robot Interaction and HRI · Speech Recognition and Synthesis