A Multimodal Fusion Network For Student Emotion Recognition Based on Transformer and Tensor Product
Ao Xiang, Zongqing Qi, Han Wang, Qin Yang, Danqing Ma

TL;DR
This paper presents a multimodal Transformer-based model utilizing tensor product fusion to classify student emotions with high accuracy, outperforming existing methods and highlighting the potential for integrating diverse data modalities.
Contribution
Introduces a novel multimodal fusion network combining Transformer architecture and tensor product strategy for improved emotion recognition.
Findings
Achieved 93.65% classification accuracy.
Outperformed models like CLIP and ViLBERT.
Demonstrated faster inference speed.
Abstract
This paper introduces a new multi-modal model based on the Transformer architecture and tensor product fusion strategy, combining BERT's text vectors and ViT's image vectors to classify students' psychological conditions, with an accuracy of 93.65%. The purpose of the study is to accurately analyze the mental health status of students from various data sources. This paper discusses modal fusion methods, including early, late and intermediate fusion, to overcome the challenges of integrating multi-modal information. Ablation studies compare the performance of different models and fusion techniques, showing that the proposed model outperforms existing methods such as CLIP and ViLBERT in terms of accuracy and inference speed. Conclusions indicate that while this model has significant advantages in emotion recognition, its potential to incorporate other data modalities provides areas for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducational Technology and Pedagogy · Hand Gesture Recognition Systems · Advanced Computing and Algorithms
MethodsAttention Is All You Need · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Dropout · Dense Connections · Label Smoothing · Residual Connection · Softmax · Adam
