TL;DR
This paper investigates the use of intermediate transformer layer representations in Vision Transformers for face image quality assessment, proposing a fusion strategy that enhances performance and efficiency.
Contribution
It is the first to systematically analyze and leverage intermediate ViT features for face quality assessment through a novel score fusion framework.
Findings
Different transformer depths capture complementary quality information.
Score fusion improves quality assessment over single-exit methods.
Depth-weighted averaging balances performance and computational efficiency.
Abstract
Face Image Quality Assessment is crucial for reliable face recognition systems, yet existing Vision Transformer-based approaches rely exclusively on final-layer representations, ignoring quality-relevant information captured at intermediate network depths. This paper presents the first comprehensive investigation of how intermediate representations within ViTs contribute to face quality assessment through early exit mechanisms and score fusion strategies. We systematically analyze all twelve transformer blocks of ViT-FIQA architectures, demonstrating that different depths capture distinct and complementary quality-relevant information, as evidenced by varying attention patterns and performance characteristics across network layers. We propose a score fusion framework that combines quality predictions from multiple transformer blocks without architectural modifications or additional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
