Improving Face Recognition from Caption Supervision with Multi-Granular   Contextual Feature Aggregation

Md Mahedi Hasan; Nasser Nasrabadi

arXiv:2308.06866·cs.CV·August 15, 2023

Improving Face Recognition from Caption Supervision with Multi-Granular Contextual Feature Aggregation

Md Mahedi Hasan, Nasser Nasrabadi

PDF

Open Access

TL;DR

This paper presents a novel framework called CGFR that leverages caption-guided contextual feature aggregation and refinement to enhance face recognition accuracy by effectively integrating textual descriptions with facial images.

Contribution

The paper introduces CFAM and TFRM modules to improve multi-modal feature fusion, addressing modality heterogeneity and enhancing textual feature discriminability in face recognition.

Findings

01

Significant performance improvements on Multi-Modal CelebA-HQ dataset.

02

Enhanced 1:1 verification accuracy with caption guidance.

03

Improved 1:N identification results using the proposed framework.

Abstract

We introduce caption-guided face recognition (CGFR) as a new framework to improve the performance of commercial-off-the-shelf (COTS) face recognition (FR) systems. In contrast to combining soft biometrics (eg., facial marks, gender, and age) with face images, in this work, we use facial descriptions provided by face examiners as a piece of auxiliary information. However, due to the heterogeneity of the modalities, improving the performance by directly fusing the textual and facial features is very challenging, as both lie in different embedding spaces. In this paper, we propose a contextual feature aggregation module (CFAM) that addresses this issue by effectively exploiting the fine-grained word-region interaction and global image-caption association. Specifically, CFAM adopts a self-attention and a cross-attention scheme for improving the intra-modality and inter-modality relationship…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Adam · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Residual Connection · Dense Connections · Dropout · WordPiece · Attention Dropout