FACE-net: Factual Calibration and Emotion Augmentation for Retrieval-enhanced Emotional Video Captioning

Weidong Chen; Cheng Ye; Zhendong Mao; Peipei Song; Xinyan Liu; Lei Zhang; Xiaojun Chang; Yongdong Zhang

arXiv:2603.17455·cs.CV·March 19, 2026

FACE-net: Factual Calibration and Emotion Augmentation for Retrieval-enhanced Emotional Video Captioning

Weidong Chen, Cheng Ye, Zhendong Mao, Peipei Song, Xinyan Liu, Lei Zhang, Xiaojun Chang, Yongdong Zhang

PDF

Open Access

TL;DR

FACE-net introduces a retrieval-enhanced framework for emotional video captioning that effectively mines factual and emotional cues, providing adaptive guidance to generate more accurate and bias-aware descriptions.

Contribution

The paper proposes a novel retrieval-based architecture with factual calibration and emotion augmentation modules to improve factual and emotional content integration in video captioning.

Findings

01

Enhanced factual semantics through retrieval and calibration.

02

Adaptive emotion augmentation improves emotional relevance.

03

Reduces factual-emotional bias in generated captions.

Abstract

Emotional Video Captioning (EVC) is an emerging task, which aims to describe factual content with the intrinsic emotions expressed in videos. Existing works perceive global emotional cues and then combine with video content to generate descriptions. However, insufficient factual and emotional cues mining and coordination during generation make their methods difficult to deal with the factual-emotional bias, which refers to the factual and emotional requirements being different in different samples on generation. To this end, we propose a retrieval-enhanced framework with FActual Calibration and Emotion augmentation (FACE-net), which through a unified architecture collaboratively mines factual-emotional semantics and provides adaptive and accurate guidance for generation, breaking through the compromising tendency of factual-emotional descriptions in all sample learning. Technically, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Generative Adversarial Networks and Image Synthesis