Weak to Strong: VLM-Based Pseudo-Labeling as a Weakly Supervised Training Strategy in Multimodal Video-based Hidden Emotion Understanding Tasks

Yufei Wang; Haixu Liu; Tianxiang Xu; Chuancheng Shi; Hongsheng Xing

arXiv:2602.08057·cs.CV·February 10, 2026

Weak to Strong: VLM-Based Pseudo-Labeling as a Weakly Supervised Training Strategy in Multimodal Video-based Hidden Emotion Understanding Tasks

Yufei Wang, Haixu Liu, Tianxiang Xu, Chuancheng Shi, Hongsheng Xing

PDF

Open Access

TL;DR

This paper introduces a multimodal weakly supervised framework for hidden emotion recognition in videos, leveraging pseudo-labeling and transformers to improve accuracy and establish new benchmarks.

Contribution

It proposes a novel weak-supervision strategy using VLM-based pseudo-labeling, combining multiple modalities and simplified models for enhanced emotion understanding.

Findings

01

Achieved state-of-the-art accuracy over 0.69 on iMiGUE dataset.

02

Validated that MLP-based key-point models can outperform GCNs.

03

Demonstrated effectiveness despite severe class imbalance.

Abstract

To tackle the automatic recognition of "concealed emotions" in videos, this paper proposes a multimodal weak-supervision framework and achieves state-of-the-art results on the iMiGUE tennis-interview dataset. First, YOLO 11x detects and crops human portraits frame-by-frame, and DINOv2-Base extracts visual features from the cropped regions. Next, by integrating Chain-of-Thought and Reflection prompting (CoT + Reflection), Gemini 2.5 Pro automatically generates pseudo-labels and reasoning texts that serve as weak supervision for downstream models. Subsequently, OpenPose produces 137-dimensional key-point sequences, augmented with inter-frame offset features; the usual graph neural network backbone is simplified to an MLP to efficiently model the spatiotemporal relationships of the three key-point streams. An ultra-long-sequence Transformer independently encodes both the image and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Human Pose and Action Recognition · Multimodal Machine Learning Applications