OMG: Observe Multiple Granularities for Natural Language-Based Vehicle Retrieval
Yunhao Du, Binyu Zhang, Xiangning Ruan, Fei Su, Zhicheng Zhao, Hong, Chen

TL;DR
This paper introduces OMG, a novel framework for vehicle retrieval using natural language, which leverages multiple granularities in visual and textual representations and employs a multi-granularity contrastive loss, significantly improving retrieval accuracy.
Contribution
The paper proposes a multi-granularity approach for vehicle retrieval that fully exploits different levels of visual and textual information, enhancing cross-modal matching performance.
Findings
Outperforms previous methods significantly
Ranks 9th on AI City Challenge Track2
Effective multi-granularity contrastive loss
Abstract
Retrieving tracked-vehicles by natural language descriptions plays a critical role in smart city construction. It aims to find the best match for the given texts from a set of tracked vehicles in surveillance videos. Existing works generally solve it by a dual-stream framework, which consists of a text encoder, a visual encoder and a cross-modal loss function. Although some progress has been made, they failed to fully exploit the information at various levels of granularity. To tackle this issue, we propose a novel framework for the natural language-based vehicle retrieval task, OMG, which Observes Multiple Granularities with respect to visual representation, textual representation and objective functions. For the visual representation, target features, context features and motion features are encoded separately. For the textual representation, one global embedding, three local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Mobility and Location-Based Analysis · Video Surveillance and Tracking Methods
