Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures
Simiao Ren, Xingyu Shen, Ankit Raj, Albert Dai, Caroline (Manlin) Zhang, Yuan Xu, Zexi Chen, Siqi Wu, Chen Gong, Yuxin Zhang

TL;DR
This study benchmarks 34 vision-language and traditional models for facial age estimation across multiple datasets, revealing that zero-shot VLMs outperform specialized architectures, especially in age verification tasks.
Contribution
It provides the first large-scale comparison showing that general-purpose VLMs outperform specialized age estimation models in zero-shot settings.
Findings
VLMs achieve an average MAE of 5.65 years, outperforming non-LLM models with 9.88 years.
The best VLM (Gemini 3) has an MAE of 4.32, surpassing the best non-LLM (MiVOLO) with 5.10.
VLMs significantly reduce false adult rates in age verification tasks.
Abstract
Facial age estimation plays a critical role in content moderation, age verification, and deepfake detection. However, no prior benchmark has systematically compared modern vision-language models (VLMs) with specialized age estimation architectures. We present the first large-scale cross-paradigm benchmark, evaluating 34 models - 22 specialized architectures with publicly available pretrained weights and 12 general-purpose VLMs - across eight standard datasets (UTKFace, IMDB-WIKI, MORPH, AFAD, CACD, FG-NET, APPA-REAL, and AgeDB), totaling 1,100 test images per model. Our key finding is striking: zero-shot VLMs significantly outperform most specialized models, achieving an average mean absolute error (MAE) of 5.65 years compared to 9.88 years for non-LLM models. The best-performing VLM (Gemini 3 Flash Preview, MAE 4.32) surpasses the strongest non-LLM model (MiVOLO, MAE 5.10) by 15%.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Facial Rejuvenation and Surgery Techniques
