Human Re-ID Meets LVLMs: What can we expect?
Kailash Hambarde, Pranita Samale, Hugo Proen\c{c}a

TL;DR
This paper evaluates the performance of large vision-language models in human re-identification, comparing them to specialized AI models, revealing their strengths and limitations, and suggesting future hybrid approaches.
Contribution
It provides a comprehensive comparison of LVLMs and specialized ReID models on the Market1501 dataset, highlighting their capabilities and shortcomings.
Findings
LVLMs show strengths in certain metrics.
LVLMs have severe limitations leading to errors.
Hybrid approaches could improve performance.
Abstract
Large vision-language models (LVLMs) have been regarded as a breakthrough advance in an astoundingly variety of tasks, from content generation to virtual assistants and multimodal search or retrieval. However, for many of these applications, the performance of these methods has been widely criticized, particularly when compared with state-of-the-art methods and technologies in each specific domain. In this work, we compare the performance of the leading large vision-language models in the human re-identification task, using as baseline the performance attained by state-of-the-art AI models specifically designed for this problem. We compare the results due to ChatGPT-4o, Gemini-2.0-Flash, Claude 3.5 Sonnet, and Qwen-VL-Max to a baseline ReID PersonViT model, using the well-known Market1501 dataset. Our evaluation pipeline includes the dataset curation, prompt engineering, and metric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management
