ViT-FIQA: Assessing Face Image Quality using Vision Transformers
Andrea Atzori, Fadi Boutros, Naser Damer

TL;DR
ViT-FIQA introduces a novel Vision Transformer-based method with a learnable quality token to accurately assess face image utility for recognition tasks, outperforming CNN-based approaches.
Contribution
This work pioneers the use of Vision Transformers with a learnable quality token for face image quality assessment, demonstrating superior performance over existing CNN-based methods.
Findings
ViT-FIQA achieves top-tier results on benchmark datasets.
Transformer architecture effectively models face image utility.
The learnable quality token improves utility prediction accuracy.
Abstract
Face Image Quality Assessment (FIQA) aims to predict the utility of a face image for face recognition (FR) systems. State-of-the-art FIQA methods mainly rely on convolutional neural networks (CNNs), leaving the potential of Vision Transformer (ViT) architectures underexplored. This work proposes ViT-FIQA, a novel approach that extends standard ViT backbones, originally optimized for FR, through a learnable quality token designed to predict a scalar utility score for any given face image. The learnable quality token is concatenated with the standard image patch tokens, and the whole sequence is processed via global self-attention by the ViT encoders to aggregate contextual information across all patches. At the output of the backbone, ViT-FIQA branches into two heads: (1) the patch tokens are passed through a fully connected layer to learn discriminative face representations via a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
