Push the Limit of Multi-modal Emotion Recognition by Prompting LLMs with Receptive-Field-Aware Attention Weighting

Han Zhang; Yu Lu; Liyun Zhang; Dian Ding; Dinghua Zhao; Yi-Chao Chen; Ye Wu; and Guangtao Xue

arXiv:2411.17674·cs.CL·October 3, 2025

Push the Limit of Multi-modal Emotion Recognition by Prompting LLMs with Receptive-Field-Aware Attention Weighting

Han Zhang, Yu Lu, Liyun Zhang, Dian Ding, Dinghua Zhao, Yi-Chao Chen, Ye Wu, and Guangtao Xue

PDF

Open Access

TL;DR

This paper introduces Lantern, a framework that enhances multi-modal emotion recognition by prompting large language models with receptive-field-aware attention weighting, effectively integrating multimedia features and external knowledge.

Contribution

Lantern is a novel framework that combines vanilla models with LLMs using receptive-field-aware attention to improve emotion recognition accuracy.

Findings

01

Lantern improves vanilla model performance by up to 1.80%.

02

The framework effectively integrates multimedia features and external knowledge.

03

Experiments on IEMOCAP show significant accuracy gains.

Abstract

Understanding the emotions in a dialogue usually requires external knowledge to accurately understand the contents. As the LLMs become more and more powerful, we do not want to settle on the limited ability of the pre-trained language model. However, the LLMs either can only process text modality or are too expensive to process the multimedia information. We aim to utilize both the power of LLMs and the supplementary features from the multimedia modalities. In this paper, we present a framework, Lantern, that can improve the performance of a certain vanilla model by prompting large language models with receptive-field-aware attention weighting. This framework trained a multi-task vanilla model to produce probabilities of emotion classes and dimension scores. These predictions are fed into the LLMs as references to adjust the predicted probabilities of each emotion class with its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition

MethodsAttention Is All You Need · Dense Connections · Label Smoothing · Dropout · Linear Layer · Layer Normalization · Byte Pair Encoding · Adam · Residual Connection · Softmax