Exploiting temporal information to detect conversational groups in   videos and predict the next speaker

Lucrezia Tosato; Victor Fortier; Isabelle Bloch; Catherine Pelachaud

arXiv:2408.16380·cs.CV·August 30, 2024

Exploiting temporal information to detect conversational groups in videos and predict the next speaker

Lucrezia Tosato, Victor Fortier, Isabelle Bloch, Catherine Pelachaud

PDF

Open Access

TL;DR

This paper presents a method that uses temporal and multimodal signals in videos to detect social groups and accurately predict the next speaker in group conversations, leveraging LSTM networks.

Contribution

It introduces a novel approach combining engagement levels and LSTM to improve group detection and speaker prediction in videos.

Findings

01

85% true positives in group detection

02

98% accuracy in predicting the next speaker

03

Effective use of temporal and multimodal features

Abstract

Studies in human human interaction have introduced the concept of F formation to describe the spatial arrangement of participants during social interactions. This paper has two objectives. It aims at detecting F formations in video sequences and predicting the next speaker in a group conversation. The proposed approach exploits time information and human multimodal signals in video sequences. In particular, we rely on measuring the engagement level of people as a feature of group belonging. Our approach makes use of a recursive neural network, the Long Short Term Memory (LSTM), to predict who will take the speaker's turn in a conversation group. Experiments on the MatchNMingle dataset led to 85% true positives in group detection and 98% accuracy in predicting the next speaker.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPublic Relations and Crisis Communication · Speech and Audio Processing · Speech and dialogue systems