TL;DR
This paper introduces PFCVR, a novel part-level fine-grained cross-modal vehicle retrieval model for text-to-image re-identification, along with a new large-scale dataset T2I-VeRW.
Contribution
The paper proposes a new model with local part-level alignment and a bi-directional mask recovery module, and constructs a large-scale dataset for text-to-image vehicle re-identification.
Findings
PFCVR achieves 29.2% Rank-1 accuracy on T2I-VeRI, surpassing previous methods.
On T2I-VeRW, PFCVR attains 55.2% Rank-1 accuracy, outperforming recent state-of-the-art models.
Abstract
Vehicle Re-identification (Re-ID) aims to retrieve the most similar image to a given query from images captured by non-overlapping cameras. Extending vehicle Re-ID from image-only queries to text-based queries enables retrieval in real-world scenarios where only a witness description of the target vehicle is available. In this paper, we propose PFCVR, a Part-level Fine-grained Cross-modal Vehicle Retrieval model for text-to-image vehicle re-identification. PFCVR constructs locally paired images and texts at the part level and introduces learnable part-query tokens that aggregate both part-specific and full-sentence context before aligning with visual part features. On top of this explicit local alignment, a bi-directional mask recovery module lets each modality reconstruct its masked content under the guidance of the other, implicitly bridging local correspondences into global feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
