# Revisiting Text-Based Person Retrieval: Mitigating Annotation-Induced Mismatches with Multimodal Large Language Models

**Authors:** Zihang Han, Chao Zhu, Mengyin Liu

PMC · DOI: 10.3390/s26051599 · Sensors (Basel, Switzerland) · 2026-03-04

## TL;DR

This paper addresses issues in text-based person retrieval benchmarks by improving annotations using multimodal large language models to reduce mismatches.

## Contribution

A novel annotation refinement framework using MLLMs to generate distinctive descriptions and improve benchmark quality for TBPR.

## Key findings

- Annotation-induced mismatches in TBPR benchmarks are caused by ambiguous descriptions of similar person images.
- The proposed framework improves annotation quality and benefits mainstream TBPR models through more discriminative captions.
- Experiments on three benchmarks validate the effectiveness of the method in reducing mismatches.

## Abstract

Text-based person retrieval (TBPR) aims to search for target person images from large-scale video clips or image databases based on textual descriptions. The quality of benchmarks is critical to accurately evaluating TBPR models for their ability in relation to cross-modal matching. However, we find that existing TBPR benchmarks have a common problem, which often leads to ambiguities where multiple images of persons with different identities have very similar or even identical textual descriptions. As a consequence, although TBPR models correctly retrieve the images corresponding to a given description, such matches may be erroneously evaluated as mismatches due to the above annotation problem. We argue that the main cause of this problem is that each person image is annotated individually without reference to other similar images, making it challenging to provide distinctive descriptions for each image. To address this problem, we propose an effective and efficient annotation refinement framework to improve the annotation quality of TBPR benchmarks and thereby mitigate annotation-induced mismatches. Firstly, sets of images prone to mismatches are automatically identified by TBPR models. Then, by leveraging multimodal large language models (MLLMs), multiple images are simultaneously processed and distinctive descriptions are generated for each image. Finally, the original descriptions are replaced to improve the annotation quality. Extensive experiments on three popular TBPR benchmarks (CUHK-PEDES, RSTPReid and ICFG-PEDES) validate the effectiveness of our proposed method for improving the quality of annotations, and demonstrate that the resulting more discriminative captions can truly benefit the mainstream TBPR models. The improved annotations of these benchmarks will be released publicly.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12987367/full.md

## Figures

12 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12987367/full.md

## References

31 references — full list in the complete paper: https://tomesphere.com/paper/PMC12987367/full.md

---
Source: https://tomesphere.com/paper/PMC12987367