Listen to Your Face: Inferring Facial Action Units from Audio Channel
Zibo Meng, Shizhong Han, Yan Tong

TL;DR
This paper introduces a novel audio-based approach for recognizing speech-related facial action units using a continuous time Bayesian network, outperforming visual methods especially in challenging conditions with occlusions or head movements.
Contribution
It presents a new audio-only AU recognition framework based on CTBN modeling of the relationship between AUs and phonemes, validated on a novel audiovisual AU-coded database.
Findings
Achieves promising recognition accuracy for 7 speech-related AUs.
Outperforms state-of-the-art visual methods, especially in challenging scenarios.
Demonstrates robustness to occlusions and head movements.
Abstract
Extensive efforts have been devoted to recognizing facial action units (AUs). However, it is still challenging to recognize AUs from spontaneous facial displays especially when they are accompanied with speech. Different from all prior work that utilized visual observations for facial AU recognition, this paper presents a novel approach that recognizes speech-related AUs exclusively from audio signals based on the fact that facial activities are highly correlated with voice during speech. Specifically, dynamic and physiological relationships between AUs and phonemes are modeled through a continuous time Bayesian network (CTBN); then AU recognition is performed by probabilistic inference via the CTBN model. A pilot audiovisual AU-coded database has been constructed to evaluate the proposed audio-based AU recognition framework. The database consists of a "clean" subset with frontal and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · EEG and Brain-Computer Interfaces · Speech and Audio Processing
