Human Latency Conversational Turns for Spoken Avatar Systems

Derek Jacoby; Tianyi Zhang; Aanchan Mohan; Yvonne Coady

arXiv:2404.16053·cs.HC·April 26, 2024·5 cites

Human Latency Conversational Turns for Spoken Avatar Systems

Derek Jacoby, Tianyi Zhang, Aanchan Mohan, Yvonne Coady

PDF

Open Access 1 Models

TL;DR

This paper explores methods for enabling spoken avatar systems to generate responses in near real-time by predicting missing parts of utterances, aiming to match human conversational latencies.

Contribution

It introduces techniques for understanding incomplete utterances and generating responses quickly, utilizing GPT-4 to fill in missing context and proposing a classifier to detect semantic completeness.

Findings

01

GPT-4 can fill in missing context over 60% of the time

02

A simple classifier can determine if an utterance is complete or needs filler

03

Methods enable near real-time responses matching human dialogue timing

Abstract

A problem with many current Large Language Model (LLM) driven spoken dialogues is the response time. Some efforts such as Groq address this issue by lightning fast processing of the LLM, but we know from the cognitive psychology literature that in human-to-human dialogue often responses occur prior to the speaker completing their utterance. No amount of delay for LLM processing is acceptable if we wish to maintain human dialogue latencies. In this paper, we discuss methods for understanding an utterance in close to real time and generating a response so that the system can comply with human-level conversational turn delays. This means that the information content of the final part of the speaker's utterance is lost to the LLM. Using the Google NaturalQuestions (NQ) database, our results show GPT-4 can effectively fill in missing context from a dropped word at the end of a question over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
dahara1/orpheus-3b-0.1-ft_gguf
model· 57 dl
57 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Intelligent Tutoring Systems and Adaptive Learning · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Dropout · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Linear Layer · Dense Connections · Label Smoothing