ReText: Text Boosts Generalization in Image-Based Person Re-identification
Timur Mamedov, Karina Kvanchiani, Anton Konushin, Vadim Konushin

TL;DR
ReText introduces a multimodal training approach combining multi-camera and text-annotated single-camera data to enhance generalization in image-based person re-identification across unseen domains.
Contribution
It is the first to explore multimodal joint learning with text and images in person Re-ID, improving cross-domain generalization without complex architectures.
Findings
ReText outperforms state-of-the-art methods on cross-domain Re-ID benchmarks.
Joint training on image and text data enhances semantic understanding.
The method achieves strong generalization in unseen environments.
Abstract
Generalizable image-based person re-identification (Re-ID) aims to recognize individuals across cameras in unseen domains without retraining. While multiple existing approaches address the domain gap through complex architectures, recent findings indicate that better generalization can be achieved by stylistically diverse single-camera data. Although this data is easy to collect, it lacks complexity due to minimal cross-view variation. We propose ReText, a novel method trained on a mixture of multi-camera Re-ID data and single-camera data, where the latter is complemented by textual descriptions to enrich semantic cues. During training, ReText jointly optimizes three tasks: (1) Re-ID on multi-camera data, (2) image-text matching, and (3) image reconstruction guided by text on single-camera data. Experiments demonstrate that ReText achieves strong generalization and significantly…
Peer Reviews
Decision·Submitted to ICLR 2026
The description of the proposed method is easy to follow.
1. Lack of comparison with some important SOTA methods. (1) Given that the proposed method is trained on a large-scale dataset (see Tab.1, more than 4.7M images), recent large-scale pretraining methods for ReID should be included during comparison. For example, PLIP[1], which only uses SYNTH-PEDES but shows a better performance on Market. (2) Besides, this paper is also related to multimodel multitask training, therefore, Instruct-ReID[2] should also be included, which also achieves great perfor
1. The figures and tables of this paper are relatively clear, and its writing is easy to understand. 2. It uses mixed training of multi-camera Re-ID data and text-annotated single-camera data, addressing both the scarcity of multi-camera data and the lack of cross-view variation in single-camera data. 3. ReText achieves better experimental results on multiple datasets compared with multiple existing methods.
1. Multi-task joint learning is not a novel concept in fields such as person re-identification or person retrieval. Many works employ methods based on generation/reconstruction and image-text matching for joint learning with ReID. The methods used in each task in this paper are mostly the introduction or adaptation of existing technologies; thus, the improvements in framework design are incremental. 2. The authors note in the implementation details that ReText is trained on a hybrid dataset con
[+] GOOD performance.
[-] The paper offers almost no novelty. The losses and modules employed are off-the-shelf, and the dataset used is already publicly available. [-] The motivation is somewhat confused. As the summary states, the paper argues that “single-camera data is less complex due to limited cross-view variation.” But why does the lack of cross-view necessarily imply less complexity? Take LUPerson, for example—a dataset collected from numerous videos across diverse scenes. Surely, that is complex data. I agr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Gait Recognition and Analysis · Human Pose and Action Recognition
