Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures

Simiao Ren; Xingyu Shen; Ankit Raj; Albert Dai; Caroline (Manlin) Zhang; Yuan Xu; Zexi Chen; Siqi Wu; Chen Gong; Yuxin Zhang

arXiv:2602.07815·cs.CV·February 12, 2026

Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures

Simiao Ren, Xingyu Shen, Ankit Raj, Albert Dai, Caroline (Manlin) Zhang, Yuan Xu, Zexi Chen, Siqi Wu, Chen Gong, Yuxin Zhang

PDF

Open Access

TL;DR

This study benchmarks 34 vision-language and traditional models for facial age estimation across multiple datasets, revealing that zero-shot VLMs outperform specialized architectures, especially in age verification tasks.

Contribution

It provides the first large-scale comparison showing that general-purpose VLMs outperform specialized age estimation models in zero-shot settings.

Findings

01

VLMs achieve an average MAE of 5.65 years, outperforming non-LLM models with 9.88 years.

02

The best VLM (Gemini 3) has an MAE of 4.32, surpassing the best non-LLM (MiVOLO) with 5.10.

03

VLMs significantly reduce false adult rates in age verification tasks.

Abstract

Facial age estimation plays a critical role in content moderation, age verification, and deepfake detection. However, no prior benchmark has systematically compared modern vision-language models (VLMs) with specialized age estimation architectures. We present the first large-scale cross-paradigm benchmark, evaluating 34 models - 22 specialized architectures with publicly available pretrained weights and 12 general-purpose VLMs - across eight standard datasets (UTKFace, IMDB-WIKI, MORPH, AFAD, CACD, FG-NET, APPA-REAL, and AgeDB), totaling 1,100 test images per model. Our key finding is striking: zero-shot VLMs significantly outperform most specialized models, achieving an average mean absolute error (MAE) of 5.65 years compared to 9.88 years for non-LLM models. The best-performing VLM (Gemini 3 Flash Preview, MAE 4.32) surpasses the strongest non-LLM model (MiVOLO, MAE 5.10) by 15%.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Facial Rejuvenation and Surgery Techniques