Lla-VAP: LSTM Ensemble of Llama and VAP for Turn-Taking Prediction

Hyunbae Jeon; Frederic Guintu; Rayvant Sahni

arXiv:2412.18061·cs.SD·December 25, 2024

Lla-VAP: LSTM Ensemble of Llama and VAP for Turn-Taking Prediction

Hyunbae Jeon, Frederic Guintu, Rayvant Sahni

PDF

Open Access

TL;DR

This paper introduces Lla-VAP, an ensemble approach combining Llama-based language models and voice activity projection to improve turn-taking prediction accuracy in conversations, evaluated on multiple datasets.

Contribution

It presents a novel multi-modal ensemble method integrating LLMs and VAP models for more accurate turn-taking prediction in diverse conversational settings.

Findings

01

Improved prediction accuracy over existing models

02

Effective on both scripted and unscripted conversations

03

Identifies strengths and limitations of current approaches

Abstract

Turn-taking prediction is the task of anticipating when the speaker in a conversation will yield their turn to another speaker to begin speaking. This project expands on existing strategies for turn-taking prediction by employing a multi-modal ensemble approach that integrates large language models (LLMs) and voice activity projection (VAP) models. By combining the linguistic capabilities of LLMs with the temporal precision of VAP models, we aim to improve the accuracy and efficiency of identifying TRPs in both scripted and unscripted conversational scenarios. Our methods are evaluated on the In-Conversation Corpus (ICC) and Coached Conversational Preference Elicitation (CCPE) datasets, highlighting the strengths and limitations of current models while proposing a potentially more robust framework for enhanced prediction.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling