Towards Open-Ended Visual Recognition with Large Language Model
Qihang Yu, Xiaohui Shen, Liang-Chieh Chen

TL;DR
This paper introduces the OmniScient Model, a large language model-based mask classifier that predicts class labels generatively, eliminating the need for predefined class names during training and testing, and enabling robust open-ended visual recognition.
Contribution
The OmniScient Model is a novel LLM-based mask classifier that predicts class labels generatively, allowing open-ended recognition without human intervention or predefined class sets.
Findings
Achieves promising results on various benchmarks.
Effectively handles novel concepts in visual recognition.
Enables cross-dataset training without human interference.
Abstract
Localizing and recognizing objects in the open-ended physical world poses a long-standing challenge within the domain of machine perception. Recent methods have endeavored to address the issue by employing a class-agnostic mask (or box) proposal model, complemented by an open-vocabulary classifier (e.g., CLIP) using pre-extracted text embeddings. However, it is worth noting that these open-vocabulary recognition models still exhibit limitations in practical applications. On one hand, they rely on the provision of class names during testing, where the recognition performance heavily depends on this predefined set of semantic classes by users. On the other hand, when training with multiple datasets, human intervention is required to alleviate the label definition conflict between them. In this paper, we introduce the OmniScient Model (OSM), a novel Large Language Model (LLM) based mask…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Handwritten Text Recognition Techniques
MethodsSparse Evolutionary Training
