Facial Dynamics in Video: Instruction Tuning for Improved Facial   Expression Perception and Contextual Awareness

Jiaxing Zhao; Boyuan Sun; Xiang Chen; Xihan Wei

arXiv:2501.07978·cs.CV·January 15, 2025

Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness

Jiaxing Zhao, Boyuan Sun, Xiang Chen, Xihan Wei

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a new dataset, model, and evaluation metrics to enhance video multimodal large language models' ability to understand and describe facial expressions in videos, addressing current limitations in datasets and visual token capacity.

Contribution

The paper presents a novel instruction-following dataset for facial expression captioning, a face encoding model called FaceTrack-MM, and a new benchmark with evaluation metrics for improved facial expression perception in videos.

Findings

01

FaceTrack-MM outperforms existing models in face tracking and expression focus.

02

The dataset enables better training for subtle facial nuance recognition.

03

The new evaluation metric effectively assesses content and temporal sequence accuracy.

Abstract

Facial expression captioning has found widespread application across various domains. Recently, the emergence of video Multimodal Large Language Models (MLLMs) has shown promise in general video understanding tasks. However, describing facial expressions within videos poses two major challenges for these models: (1) the lack of adequate datasets and benchmarks, and (2) the limited visual token capacity of video MLLMs. To address these issues, this paper introduces a new instruction-following dataset tailored for dynamic facial expression caption. The dataset comprises 5,033 high-quality video clips annotated manually, containing over 700,000 tokens. Its purpose is to improve the capability of video MLLMs to discern subtle facial nuances. Furthermore, we propose FaceTrack-MM, which leverages a limited number of tokens to encode the main character's face. This model demonstrates superior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jiaxing-star/facialdynamic
noneOfficial

Videos

Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness· underline

Taxonomy

TopicsFace recognition and analysis · Emotion and Mood Recognition · Face Recognition and Perception