Investigating Audio, Visual, and Text Fusion Methods for End-to-End Automatic Personality Prediction
Onno Kampman, Elham J. Barezi, Dario Bertero, Pascale Fung

TL;DR
This paper introduces a tri-modal neural network architecture that combines audio, visual, and text data to improve automatic personality prediction from videos, outperforming single-modality models.
Contribution
The paper presents a novel multimodal fusion approach with decision-level and feature concatenation methods, demonstrating superior performance over individual modalities.
Findings
Multimodal fusion improves prediction accuracy by 9.4% over best single modality.
Full backpropagation enhances model performance compared to linear combination.
Each modality's relevance varies across different personality traits.
Abstract
We propose a tri-modal architecture to predict Big Five personality trait scores from video clips with different channels for audio, text, and video data. For each channel, stacked Convolutional Neural Networks are employed. The channels are fused both on decision-level and by concatenating their respective fully connected layers. It is shown that a multimodal fusion approach outperforms each single modality channel, with an improvement of 9.4\% over the best individual modality (video). Full backpropagation is also shown to be better than a linear combination of modalities, meaning complex interactions between modalities can be leveraged to build better models. Furthermore, we can see the prediction relevance of each modality for each trait. The described model can be used to increase the emotional intelligence of virtual agents.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods
