Tap-to-Adapt: Learning User-Aligned Response Timing for Speech Agents
Zihong He, Hai-Ning Liang, Chen Liang

TL;DR
This paper introduces Tap-to-Adapt, a framework that uses tap interactions for online learning of response timing in speech agents, improving alignment with user intent through data-driven methods.
Contribution
It presents a novel tap-based online learning framework for response timing in speech agents, incorporating Dilated TCN and replay strategies, with extensive user data collection.
Findings
Effective response timing models learned from user taps
Improved alignment with user intent demonstrated in experiments
Collected 20,000 samples from user studies
Abstract
Response timing judgment is a critical component of interactive speech agents. Although there exists substantial prior work on turn modeling and voice wake-up, there is a lack of research on response timing judgments continuously aligned with user intent. To address this, we propose the Tap-to-Adapt framework, which enables users to naturally activate or interrupt the agent via tap interactions to construct online learning labels for response timing models. Under this framework, Dilated TCN and a sequential replay strategy play significant roles, as demonstrated through data-driven experiments and user studies. Additionally, we develop an evaluation and continuous data mining system tailored for the Tap-to-Adapt framework, through which we have collected approximately 20,000 samples from the user studies involving 20 participants.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Social Robot Interaction and HRI · Speech Recognition and Synthesis
