OVMR: Open-Vocabulary Recognition with Multi-Modal References

Zehong Ma; Shiliang Zhang; Longhui Wei; Qi Tian

arXiv:2406.04675·cs.CV·June 10, 2024

OVMR: Open-Vocabulary Recognition with Multi-Modal References

Zehong Ma, Shiliang Zhang, Longhui Wei, Qi Tian

PDF

Open Access 1 Repo

TL;DR

This paper introduces OVMR, a multi-modal approach for open-vocabulary recognition that combines textual descriptions and exemplar images to improve robustness and generalization without fine-tuning.

Contribution

The paper proposes a novel multi-modal classification framework with a preference-based refinement module for open-vocabulary recognition, avoiding fine-tuning and handling low-quality data.

Findings

01

Outperforms existing methods across various scenarios.

02

Effectively integrates textual and visual cues for recognition.

03

Works well with Internet-crawled exemplar images.

Abstract

The challenge of open-vocabulary recognition lies in the model has no clue of new categories it is applied to. Existing works have proposed different methods to embed category cues into the model, \eg, through few-shot fine-tuning, providing category names or textual descriptions to Vision-Language Models. Fine-tuning is time-consuming and degrades the generalization capability. Textual descriptions could be ambiguous and fail to depict visual details. This paper tackles open-vocabulary recognition from a different perspective by referring to multi-modal clues composed of textual descriptions and exemplar images. Our method, named OVMR, adopts two innovative components to pursue a more robust category cues embedding. A multi-modal classifier is first generated by dynamically complementing textual descriptions with image exemplars. A preference-based refinement module is hence applied to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zehong-ma/ovmr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling