TL;DR
This paper introduces SAGA-ReID, a novel method for person re-identification that reconstructs identity features by aligning patch tokens with text-anchored vectors, improving robustness under occlusion and cross-camera variation.
Contribution
SAGA-ReID emphasizes spatially stable evidence by aligning patch tokens with CLIP's text embedding space, outperforming global pooling especially under occlusion.
Findings
SAGA-ReID shows up to +10.6 Rank-1 improvement on occluded benchmarks.
It outperforms global pooling as occlusion increases.
Structured reconstruction addresses limitations of backbone quality and architecture.
Abstract
CLIP-based person re-identification (ReID) methods aggregate spatial features into a single global \texttt{[CLS]} token optimized for image-text alignment rather than spatial selectivity, making representations fragile under occlusion and cross-camera variation. We propose SAGA-ReID, which reconstructs identity representations by aligning intermediate patch tokens with anchor vectors parameterized in CLIP's text embedding space -- emphasizing spatially stable evidence while suppressing corrupted or absent regions, without requiring textual descriptions of individual images. Controlled experiments isolate the aggregation mechanism under two qualitatively distinct conditions -- synthetic masking, where identity signal is absent, and realistic human distractors, where an overlapping person introduces semantically confusing signal -- with SAGA's advantage over global pooling growing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
