Improving Face Recognition from Caption Supervision with Multi-Granular Contextual Feature Aggregation
Md Mahedi Hasan, Nasser Nasrabadi

TL;DR
This paper presents a novel framework called CGFR that leverages caption-guided contextual feature aggregation and refinement to enhance face recognition accuracy by effectively integrating textual descriptions with facial images.
Contribution
The paper introduces CFAM and TFRM modules to improve multi-modal feature fusion, addressing modality heterogeneity and enhancing textual feature discriminability in face recognition.
Findings
Significant performance improvements on Multi-Modal CelebA-HQ dataset.
Enhanced 1:1 verification accuracy with caption guidance.
Improved 1:N identification results using the proposed framework.
Abstract
We introduce caption-guided face recognition (CGFR) as a new framework to improve the performance of commercial-off-the-shelf (COTS) face recognition (FR) systems. In contrast to combining soft biometrics (eg., facial marks, gender, and age) with face images, in this work, we use facial descriptions provided by face examiners as a piece of auxiliary information. However, due to the heterogeneity of the modalities, improving the performance by directly fusing the textual and facial features is very challenging, as both lie in different embedding spaces. In this paper, we propose a contextual feature aggregation module (CFAM) that addresses this issue by effectively exploiting the fine-grained word-region interaction and global image-caption association. Specifically, CFAM adopts a self-attention and a cross-attention scheme for improving the intra-modality and inter-modality relationship…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Adam · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Residual Connection · Dense Connections · Dropout · WordPiece · Attention Dropout
