Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants
Lixiong Qin, Shilong Ou, Miaoxuan Zhang, Jiangning Wei, Yuhang Zhang, Xiaoshuai Song, Yuchen Liu, Mei Wang, Weiran Xu

TL;DR
This paper introduces Face-Human-Bench, a comprehensive benchmark for evaluating multi-modal assistants' understanding of faces and humans, including a hierarchical ability taxonomy, dataset, and evaluation of 25 models.
Contribution
It presents a new hierarchical ability taxonomy, a semi-automatic data pipeline, and a benchmark dataset for face and human understanding in multi-modal models.
Findings
Performance varies with ability types and target positions.
Chain of Thought prompting improves model performance.
Some abilities require specialist models for better understanding.
Abstract
Faces and humans are crucial elements in social interaction and are widely included in everyday photos and videos. Therefore, a deep understanding of faces and humans will enable multi-modal assistants to achieve improved response quality and broadened application scope. Currently, the multi-modal assistant community lacks a comprehensive and scientific evaluation of face and human understanding abilities. In this paper, we first propose a hierarchical ability taxonomy that includes three levels of abilities. Then, based on this taxonomy, we collect images and annotations from publicly available datasets in the face and human community and build a semi-automatic data pipeline to produce problems for the new benchmark. Finally, the obtained Face-Human-Bench includes a development set and a test set, each with 1800 problems, supporting both English and Chinese. We conduct evaluations over…
Peer Reviews
Decision·Submitted to ICLR 2025
- **Comprehensive Benchmarking Framework**: The face-human bench spans many abilities, providing a holistic evaluation of multimodal assistants’ capabilities in the face and human understanding. - **New Metrics and Evaluation Protocols**: The paper introduces RPSS to measure sensitivity to the relative position of targets and percentile recall to assess retrieval in large galleries. These metrics provide nuanced insights, aiding model development. - **Multi-Language Support**: The benchmark ensu
- **Limited Discussion on Dataset Biases**: Although the benchmark includes diverse tasks, the paper could expand on potential biases in the benchmark datasets, especially considering the variability in demographic representations in face and human recognition tasks. - **Generalizability to Other Tasks**: The applicability of Face-Human-Bench to tasks beyond face and human recognition remains unclear. Expanding on how these benchmarks might generalize to other domains would add depth. - **Impact
The evaluation task pursued in this paper has some value especially for researchers working on face and human analysis. The problem is that most tasks evaluated are purely vision tasks for which many strong baselines exist. It's not surprising that specialists models outperform the VLLMs on these tasks. But arguably the authors have put a considerable amount of effort to organise the benchmark and evaluate the models. Finally, the experiment of section 3.4 is interesting.
Overall, unfortunately, the are a few issues with the paper which limit the impact of the proposed work: - It's not clear whether the proposed benchmark adds something to our understanding of VLLMs. - It's not clear why one would use a VLLM to accomplish the tasks mentioned in the paper which are visual perception tasks with very specific use cases. Since the proposed tasks are very different from the ones that the VLLMs were trained on it is perhaps not even meaningful to evaluate the models
1. The proposed Face-Human-Bench is a comprehensive evaluation benchmark that fully encompasses relevant tasks, making the assessment results more valuable and reliable. 2. The paper evaluates 25 existing mainstream MLLMs on Face-Human-Bench, with a substantial amount of experiments and rich content, intuitively demonstrating each MLLM's capabilities in facial and human understanding. 3. The paper is well-organized and clearly articulated, which improves readability and makes the findings access
1. The paper mentions that Face-Human-Bench consists of a development set with 900 problems and a test set with 1800 problems, but it lacks a description of the roles of these two sets. In the subsequent experimental results, which set were the results based on? 2. There is a point of confusion regarding the calculation of the overall score: how are the weights for each sub-task determined? 3. The paper states that Face-Human-Bench supports evaluations in both English and Chinese. What insights
Videos
Taxonomy
TopicsFace recognition and analysis · Social Robot Interaction and HRI · Speech and Audio Processing
MethodsSparse Evolutionary Training
