OVMR: Open-Vocabulary Recognition with Multi-Modal References
Zehong Ma, Shiliang Zhang, Longhui Wei, Qi Tian

TL;DR
This paper introduces OVMR, a multi-modal approach for open-vocabulary recognition that combines textual descriptions and exemplar images to improve robustness and generalization without fine-tuning.
Contribution
The paper proposes a novel multi-modal classification framework with a preference-based refinement module for open-vocabulary recognition, avoiding fine-tuning and handling low-quality data.
Findings
Outperforms existing methods across various scenarios.
Effectively integrates textual and visual cues for recognition.
Works well with Internet-crawled exemplar images.
Abstract
The challenge of open-vocabulary recognition lies in the model has no clue of new categories it is applied to. Existing works have proposed different methods to embed category cues into the model, \eg, through few-shot fine-tuning, providing category names or textual descriptions to Vision-Language Models. Fine-tuning is time-consuming and degrades the generalization capability. Textual descriptions could be ambiguous and fail to depict visual details. This paper tackles open-vocabulary recognition from a different perspective by referring to multi-modal clues composed of textual descriptions and exemplar images. Our method, named OVMR, adopts two innovative components to pursue a more robust category cues embedding. A multi-modal classifier is first generated by dynamically complementing textual descriptions with image exemplars. A preference-based refinement module is hence applied to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
