Lla-VAP: LSTM Ensemble of Llama and VAP for Turn-Taking Prediction
Hyunbae Jeon, Frederic Guintu, Rayvant Sahni

TL;DR
This paper introduces Lla-VAP, an ensemble approach combining Llama-based language models and voice activity projection to improve turn-taking prediction accuracy in conversations, evaluated on multiple datasets.
Contribution
It presents a novel multi-modal ensemble method integrating LLMs and VAP models for more accurate turn-taking prediction in diverse conversational settings.
Findings
Improved prediction accuracy over existing models
Effective on both scripted and unscripted conversations
Identifies strengths and limitations of current approaches
Abstract
Turn-taking prediction is the task of anticipating when the speaker in a conversation will yield their turn to another speaker to begin speaking. This project expands on existing strategies for turn-taking prediction by employing a multi-modal ensemble approach that integrates large language models (LLMs) and voice activity projection (VAP) models. By combining the linguistic capabilities of LLMs with the temporal precision of VAP models, we aim to improve the accuracy and efficiency of identifying TRPs in both scripted and unscripted conversational scenarios. Our methods are evaluated on the In-Conversation Corpus (ICC) and Coached Conversational Preference Elicitation (CCPE) datasets, highlighting the strengths and limitations of current models while proposing a potentially more robust framework for enhanced prediction.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling
