Too Slow to Be Useful? On Incorporating Humans in the Loop of Smart Speakers
Shih-Hong Huang, Chieh-Yang Huang, Yuxin Deng, Hua Shen, Szu-Chi Kuan,, Ting-Hao 'Kenneth' Huang

TL;DR
This paper investigates the latency challenges of integrating human workers into real-time smart speaker systems, demonstrating that human-in-the-loop approaches face significant delays that hinder practical deployment.
Contribution
It quantifies the latency issues of human-in-the-loop systems in voice interactions and highlights their limitations for time-sensitive applications.
Findings
Human-in-the-loop systems have latency exceeding seconds, unsuitable for real-time responses.
Participants experienced delays that affected conversation quality.
The study provides empirical data on the bottlenecks of human-powered voice systems.
Abstract
Real-time crowd-powered systems, such as Chorus/Evorus, VizWiz, and Apparition, have shown how incorporating humans into automated systems could supplement where the automatic solutions fall short. However, one unspoken bottleneck of applying such architectures to more scenarios is the longer latency of including humans in the loop of automated systems. For the applications that have hard constraints in turnaround times, human-operated components' longer latency and large speed variation seem to be apparent deal breakers. This paper explicates and quantifies these limitations by using a human-powered text-based backend to hold conversations with users through a voice-only smart speaker. Smart speakers must respond to users' requests within seconds, so the workers behind the scenes only have a few seconds to compose answers. We measured the end-to-end system latency and the conversation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Speech and dialogue systems · Context-Aware Activity Recognition Systems
