CLIP-HandID: Vision-Language Model for Hand-Based Person Identification
Nathanael L. Baisa, Babu Pallam, Amudhavel Jayavel

TL;DR
This paper presents CLIP-HandID, a novel vision-language approach that uses CLIP and pseudo-tokens to improve hand-based person identification, especially useful in criminal investigations with limited evidence.
Contribution
It introduces a new method leveraging CLIP and pseudo-tokens for discriminative hand image features, enhancing identification accuracy over existing methods.
Findings
Significantly outperforms existing approaches on large hand datasets.
Effectively leverages multi-modal reasoning for better generalization.
Demonstrates robustness across multi-ethnic hand images.
Abstract
This paper introduces a novel approach to person identification using hand images, designed specifically for criminal investigations. The method is particularly valuable in serious crimes such as sexual abuse, where hand images are often the only identifiable evidence available. Our proposed method, CLIP-HandID, leverages a pre-trained foundational vision-language model - CLIP - to efficiently learn discriminative deep feature representations from hand images (input to CLIP's image encoder) using textual prompts as semantic guidance. Since hand images are labeled with indexes rather than text descriptions, we employ a textual inversion network to learn pseudo-tokens that encode specific visual contexts or appearance attributes. These learned pseudo-tokens are then incorporated into textual prompts, which are fed into CLIP's text encoder to leverage its multi-modal reasoning and enhance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
