Benchmarking Foundation Models for Zero-Shot Biometric Tasks
Redwan Sony, Parisa Farmanifard, Hamzeh Alzwairy, Nitish Shukla, Arun Ross

TL;DR
This paper benchmarks the zero-shot and few-shot performance of 41 vision-language and multi-modal large language models across six biometric tasks, revealing their potential and limitations in biometric recognition without fine-tuning.
Contribution
It introduces a comprehensive benchmark for evaluating foundation models on biometric tasks, highlighting their capabilities and challenges in zero-shot settings.
Findings
High accuracy in face verification (96.77% TMR at 1% FMR) without fine-tuning.
Effective iris recognition with 97.55% TMR at 1% FMR without fine-tuning.
Simple classifiers on embeddings can detect DeepFakes, PAD, and predict soft biometric attributes.
Abstract
The advent of foundation models, particularly Vision-Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), has redefined the frontiers of artificial intelligence, enabling remarkable generalization across diverse tasks with minimal or no supervision. Yet, their potential in biometric recognition and analysis remains relatively underexplored. In this work, we introduce a comprehensive benchmark that evaluates the zero-shot and few-shot performance of state-of-the-art publicly available VLMs and MLLMs across six biometric tasks spanning the face and iris modalities: face verification, soft biometric attribute prediction (gender and race), iris recognition, presentation attack detection (PAD), and face manipulation detection (morphs and deepfakes). A total of 41 VLMs were used in this evaluation. Experiments show that embeddings from these foundation models can be used for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
