Ego: Embedding-Guided Personalization of Vision-Language Models

Soroush Seifi; Simon Gardier; Vaggelis Dorovatas; Daniel Olmeda Reino; Rahaf Aljundi

arXiv:2603.09771·cs.CV·March 12, 2026

Ego: Embedding-Guided Personalization of Vision-Language Models

Soroush Seifi, Simon Gardier, Vaggelis Dorovatas, Daniel Olmeda Reino, Rahaf Aljundi

PDF

Open Access

TL;DR

This paper introduces an efficient method for personalizing vision-language models by extracting and utilizing internal attention-based visual tokens to recall specific concepts, enhancing personalized AI assistant capabilities.

Contribution

The proposed approach leverages the model's internal attention to personalize vision-language models without additional training, improving scalability and deployment efficiency.

Findings

01

Strong performance gains over SOTA methods

02

Effective across single, multi-concept, and video personalization

03

Minimal personalization overhead

Abstract

AI assistants that support humans in daily life are becoming increasingly feasible, driven by the rapid advancements in multimodal language models. A key challenge lies in overcoming the generic nature of these models to deliver personalized experiences. Existing approaches to personalizing large vision language models often rely on additional training stages, which limit generality and scalability, or on engineered pipelines with external pre-trained modules, which hinder deployment efficiency. In this work, we propose an efficient personalization method that leverages the model's inherent ability to capture personalized concepts. Specifically, we extract visual tokens that predominantly represent the target concept by utilizing the model's internal attention mechanisms. These tokens serve as a memory of that specific concept, enabling the model to recall and describe it when it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling