OLIVE: Object Level In-Context Visual Embeddings

Timothy Ossowski; Junjie Hu

arXiv:2406.00872·cs.CV·June 4, 2024

OLIVE: Object Level In-Context Visual Embeddings

Timothy Ossowski, Junjie Hu

PDF

Open Access 1 Repo

TL;DR

OLIVE introduces object-level visual embeddings that improve fine-grained understanding, enable controllable reasoning, and enhance zero-shot generalization in vision-language models, surpassing traditional patch-based approaches.

Contribution

The paper presents a novel object-level embedding method for vision-language models, enabling faster training, better object grounding, and zero-shot adaptation without additional training.

Findings

01

Achieves competitive object classification and captioning performance.

02

Enables zero-shot generalization to unseen objects.

03

Provides robustness in challenging visual contexts.

Abstract

Recent generalist vision-language models (VLMs) have demonstrated impressive reasoning capabilities across diverse multimodal tasks. However, these models still struggle with fine-grained object-level understanding and grounding. In terms of modeling, existing VLMs implicitly align text tokens with image patch tokens, which is ineffective for embedding alignment at the same granularity and inevitably introduces noisy spurious background features. Additionally, these models struggle when generalizing to unseen visual concepts and may not be reliable for domain-specific tasks without further fine-tuning. To address these limitations, we propose a novel method to prompt large language models with in-context visual object vectors, thereby enabling controllable object-level reasoning. This eliminates the necessity of fusing a lengthy array of image patch features and significantly speeds up…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tossowski/OLIVE
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization · Semantic Web and Ontologies

MethodsALIGN