Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness
Jiaxing Zhao, Boyuan Sun, Xiang Chen, Xihan Wei

TL;DR
This paper introduces a new dataset, model, and evaluation metrics to enhance video multimodal large language models' ability to understand and describe facial expressions in videos, addressing current limitations in datasets and visual token capacity.
Contribution
The paper presents a novel instruction-following dataset for facial expression captioning, a face encoding model called FaceTrack-MM, and a new benchmark with evaluation metrics for improved facial expression perception in videos.
Findings
FaceTrack-MM outperforms existing models in face tracking and expression focus.
The dataset enables better training for subtle facial nuance recognition.
The new evaluation metric effectively assesses content and temporal sequence accuracy.
Abstract
Facial expression captioning has found widespread application across various domains. Recently, the emergence of video Multimodal Large Language Models (MLLMs) has shown promise in general video understanding tasks. However, describing facial expressions within videos poses two major challenges for these models: (1) the lack of adequate datasets and benchmarks, and (2) the limited visual token capacity of video MLLMs. To address these issues, this paper introduces a new instruction-following dataset tailored for dynamic facial expression caption. The dataset comprises 5,033 high-quality video clips annotated manually, containing over 700,000 tokens. Its purpose is to improve the capability of video MLLMs to discern subtle facial nuances. Furthermore, we propose FaceTrack-MM, which leverages a limited number of tokens to encode the main character's face. This model demonstrates superior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsFace recognition and analysis · Emotion and Mood Recognition · Face Recognition and Perception
