CILF-CIAE: CLIP-driven Image-Language Fusion for Correcting Inverse Age   Estimation

Yuntao Shou; Wei Ai; Tao Meng; Nan Yin; Keqin Li

arXiv:2312.01758·cs.CV·September 4, 2024·5 cites

CILF-CIAE: CLIP-driven Image-Language Fusion for Correcting Inverse Age Estimation

Yuntao Shou, Wei Ai, Tao Meng, Nan Yin, Keqin Li

PDF

Open Access

TL;DR

This paper introduces CILF-CIAE, a novel multimodal approach combining CLIP, a new Transformer architecture, and reversible age estimation to improve accuracy and efficiency in facial age prediction tasks.

Contribution

The paper proposes a new CLIP-based fusion method with a linear complexity Transformer and an error feedback mechanism for more accurate age estimation.

Findings

01

Achieved superior age prediction accuracy on multiple datasets.

02

Reduced computational complexity compared to traditional attention mechanisms.

03

Enhanced multimodal feature fusion through contrastive learning.

Abstract

The age estimation task aims to predict the age of an individual by analyzing facial features in an image. The development of age estimation can improve the efficiency and accuracy of various applications (e.g., age verification and secure access control, etc.). In recent years, contrastive language-image pre-training (CLIP) has been widely used in various multimodal tasks and has made some progress in the field of age estimation. However, existing CLIP-based age estimation methods require high memory usage (quadratic complexity) when globally modeling images, and lack an error feedback mechanism to prompt the model about the quality of age prediction results. To tackle the above issues, we propose a novel CLIP-driven Image-Language Fusion for Correcting Inverse Age Estimation (CILF-CIAE). Specifically, we first introduce the CLIP model to extract image features and text semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Dropout · Dense Connections · Byte Pair Encoding · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer