Can Vision Transformers with ResNet's Global Features Fairly Authenticate Demographic Faces?
Abu Sufian, Marco Leo, Cosimo Distante, Anirudha Ghosh, and Debaditya Barman

TL;DR
This paper explores whether combining Vision Transformers with ResNet's global features can improve fairness in biometric face authentication across diverse demographic groups, using novel datasets and a prototype network.
Contribution
It introduces a new empirical framework that integrates pre-trained ViT and ResNet features with a prototype network for fair demographic face authentication.
Findings
Microsoft Swin Transformer outperformed other ViT models.
Performance improves with larger support sets in few-shot scenarios.
The approach demonstrates potential for fairer biometric authentication.
Abstract
Biometric face authentication is crucial in computer vision, but ensuring fairness and generalization across demographic groups remains a big challenge. Therefore, we investigated whether Vision Transformer (ViT) and ResNet, leveraging pre-trained global features, can fairly authenticate different demographic faces while relying minimally on local features. In this investigation, we used three pre-trained state-of-the-art (SOTA) ViT foundation models from Facebook, Google, and Microsoft for global features as well as ResNet-18. We concatenated the features from ViT and ResNet, passed them through two fully connected layers, and trained on customized face image datasets to capture the local features. Then, we designed a novel few-shot prototype network with backbone features embedding. We also developed new demographic face image support and query datasets for this empirical study. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
