Towards Unified Facial Action Unit Recognition Framework by Large Language Models
Guohong Hu, Xing Lan, Hanyu Jiang, Jiayi Lyu, Jian Xue

TL;DR
This paper introduces AU-LLaVA, a unified facial action unit recognition framework leveraging large language models, achieving state-of-the-art accuracy and versatility across multiple datasets.
Contribution
The paper presents the first unified AU recognition framework based on LLMs, combining visual encoding and language understanding for improved AU detection.
Findings
Achieves highest accuracy for nearly half of AUs on BP4D and DISFA datasets.
Improves F1-score by up to 11.4% over previous benchmarks.
Outperforms previous methods on all 24 AUs in the FEAFA dataset.
Abstract
Facial Action Units (AUs) are of great significance in the realm of affective computing. In this paper, we propose AU-LLaVA, the first unified AU recognition framework based on the Large Language Model (LLM). AU-LLaVA consists of a visual encoder, a linear projector layer, and a pre-trained LLM. We meticulously craft the text descriptions and fine-tune the model on various AU datasets, allowing it to generate different formats of AU recognition results for the same input image. On the BP4D and DISFA datasets, AU-LLaVA delivers the most accurate recognition results for nearly half of the AUs. Our model achieves improvements of F1-score up to 11.4% in specific AU recognition compared to previous benchmark results. On the FEAFA dataset, our method achieves significant improvements over all 24 AUs compared to previous benchmark results. AU-LLaVA demonstrates exceptional performance and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis
