OLIVE: Object Level In-Context Visual Embeddings
Timothy Ossowski, Junjie Hu

TL;DR
OLIVE introduces object-level visual embeddings that improve fine-grained understanding, enable controllable reasoning, and enhance zero-shot generalization in vision-language models, surpassing traditional patch-based approaches.
Contribution
The paper presents a novel object-level embedding method for vision-language models, enabling faster training, better object grounding, and zero-shot adaptation without additional training.
Findings
Achieves competitive object classification and captioning performance.
Enables zero-shot generalization to unseen objects.
Provides robustness in challenging visual contexts.
Abstract
Recent generalist vision-language models (VLMs) have demonstrated impressive reasoning capabilities across diverse multimodal tasks. However, these models still struggle with fine-grained object-level understanding and grounding. In terms of modeling, existing VLMs implicitly align text tokens with image patch tokens, which is ineffective for embedding alignment at the same granularity and inevitably introduces noisy spurious background features. Additionally, these models struggle when generalizing to unseen visual concepts and may not be reliable for domain-specific tasks without further fine-tuning. To address these limitations, we propose a novel method to prompt large language models with in-context visual object vectors, thereby enabling controllable object-level reasoning. This eliminates the necessity of fusing a lengthy array of image patch features and significantly speeds up…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Semantic Web and Ontologies
MethodsALIGN
