Multimodal Fusion with Semi-Supervised Learning Minimizes Annotation Quantity for Modeling Videoconference Conversation Experience

Andrew Chang; Chenkai Hu; Ji Qi; Zhuojian Wei; Kexin Zhang; Viswadruth Akkaraju; David Poeppel; Dustin Freeman

arXiv:2506.13971·eess.AS·August 20, 2025

Multimodal Fusion with Semi-Supervised Learning Minimizes Annotation Quantity for Modeling Videoconference Conversation Experience

Andrew Chang, Chenkai Hu, Ji Qi, Zhuojian Wei, Kexin Zhang, Viswadruth Akkaraju, David Poeppel, Dustin Freeman

PDF

TL;DR

This paper introduces a semi-supervised multimodal learning approach that effectively detects negative moments in videoconference conversations, significantly reducing the need for manual annotations while maintaining high accuracy.

Contribution

The study presents a novel semi-supervised multimodal fusion framework that minimizes annotation requirements for modeling videoconference experience, outperforming supervised models with less labeled data.

Findings

01

SSL achieves ROC-AUC of 0.9 and F1 of 0.6

02

SSL with 8% labeled data matches 96% of full-data SL performance

03

Modality-fused co-training improves detection accuracy

Abstract

Group conversations over videoconferencing are a complex social behavior. However, the subjective moments of negative experience, where the conversation loses fluidity or enjoyment remain understudied. These moments are infrequent in naturalistic data, and thus training a supervised learning (SL) model requires costly manual data annotation. We applied semi-supervised learning (SSL) to leverage targeted labeled and unlabeled clips for training multimodal (audio, facial, text) deep features to predict non-fluid or unenjoyable moments in holdout videoconference sessions. The modality-fused co-training SSL achieved an ROC-AUC of 0.9 and an F1 score of 0.6, outperforming SL models by up to 4% with the same amount of labeled data. Remarkably, the best SSL model with just 8% labeled data matched 96% of the SL model's full-data performance. This shows an annotation-efficient framework for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.