Multichannel Attention Network for Analyzing Visual Behavior in Public Speaking
Rahul Sharma, Tanaya Guha, Gaurav Sharma

TL;DR
This paper introduces a multichannel attention network that analyzes visual cues from TED talks to predict their popularity, demonstrating that non-verbal visual features are highly informative and interpretable for public speaking success.
Contribution
It presents a novel attention-based LSTM model that leverages visual features to predict talk popularity and provides interpretability of visual cue importance over time.
Findings
Visual cues alone predict popularity with high accuracy.
The model learns human-like attention mechanisms for interpretability.
Visual features significantly contribute to public speaking success.
Abstract
Public speaking is an important aspect of human communication and interaction. The majority of computational work on public speaking concentrates on analyzing the spoken content, and the verbal behavior of the speakers. While the success of public speaking largely depends on the content of the talk, and the verbal behavior, non-verbal (visual) cues, such as gestures and physical appearance also play a significant role. This paper investigates the importance of visual cues by estimating their contribution towards predicting the popularity of a public lecture. For this purpose, we constructed a large database of more than TED talk videos. As a measure of popularity of the TED talks, we leverage the corresponding (online) viewers' ratings from YouTube. Visual cues related to facial and physical appearance, facial expressions, and pose variations are extracted from the video frames…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
