ObjEmbed: Towards Universal Multimodal Object Embeddings

Shenghao Fu; Yukun Su; Fengyun Rao; Jing Lyu; Xiaohua Xie; Wei-Shi Zheng

arXiv:2602.01753·cs.CV·February 4, 2026

ObjEmbed: Towards Universal Multimodal Object Embeddings

Shenghao Fu, Yukun Su, Fengyun Rao, Jing Lyu, Xiaohua Xie, Wei-Shi Zheng

PDF

Open Access 2 Models

TL;DR

ObjEmbed introduces a novel multimodal embedding model that decomposes images into object-specific regions with semantic and spatial embeddings, enabling accurate, versatile, and efficient vision-language understanding across various tasks.

Contribution

It presents a new object-oriented multimodal embedding approach that captures semantic and spatial information, supporting diverse visual tasks with high efficiency and superior benchmark performance.

Findings

01

Outperforms existing models on 18 benchmarks.

02

Effectively handles both region-level and image-level tasks.

03

Provides accurate object localization and retrieval.

Abstract

Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often struggle with fine-grained alignment between image regions and specific phrases. In this work, we present ObjEmbed, a novel MLLM embedding model that decomposes the input image into multiple regional embeddings, each corresponding to an individual object, along with global embeddings. It supports a wide range of visual understanding tasks like visual grounding, local image retrieval, and global image retrieval. ObjEmbed enjoys three key properties: (1) Object-Oriented Representation: It captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning