Multimodal Machine Learning Can Predict Videoconference Fluidity and   Enjoyment

Andrew Chang; Viswadruth Akkaraju; Ray McFadden Cogliano; David; Poeppel; Dustin Freeman

arXiv:2501.03190·cs.LG·March 11, 2025

Multimodal Machine Learning Can Predict Videoconference Fluidity and Enjoyment

Andrew Chang, Viswadruth Akkaraju, Ray McFadden Cogliano, David, Poeppel, Dustin Freeman

PDF

Open Access

TL;DR

This paper demonstrates that multimodal machine learning models using audio, facial, and body motion data can accurately predict negative experiences and conversational issues in videoconferencing, aiding in improving user experience.

Contribution

It introduces a multimodal machine learning approach to predict videoconference fluidity and enjoyment, highlighting the effectiveness of audio-video features in identifying negative user experiences.

Findings

01

Models achieved ROC-AUC up to 0.87 on hold-out data.

02

Domain-general audio features are most critical for predictions.

03

Multimodal signals can identify rare negative experience moments.

Abstract

Videoconferencing is now a frequent mode of communication in both professional and informal settings, yet it often lacks the fluidity and enjoyment of in-person conversation. This study leverages multimodal machine learning to predict moments of negative experience in videoconferencing. We sampled thousands of short clips from the RoomReader corpus, extracting audio embeddings, facial actions, and body motion features to train models for identifying low conversational fluidity, low enjoyment, and classifying conversational events (backchanneling, interruption, or gap). Our best models achieved an ROC-AUC of up to 0.87 on hold-out videoconference sessions, with domain-general audio features proving most critical. This work demonstrates that multimodal audio-video signals can effectively predict high-level subjective conversational outcomes. In addition, this is a contribution to research…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage and Video Quality Assessment