Personal Visual Context Learning in Large Multimodal Models
Zihui Xue, Ami Baid, Sangho Kim, Mi Luo, Kristen Grauman

TL;DR
This paper introduces Personal VCL, a benchmark and baseline for enabling large multimodal models to utilize user-specific visual context for personalized queries, addressing a key gap in visual personalization.
Contribution
It formalizes Personal Visual Context Learning, creates a comprehensive benchmark, and proposes the Agentic Context Bank baseline to improve personalized visual reasoning in LMMs.
Findings
Frontier LMMs show a significant gap in utilizing visual context.
The Agentic Context Bank improves performance over standard prompting methods.
The approach offers a practical path for developing personalized multimodal models.
Abstract
As wearable devices like smart glasses integrate Large Multimodal Models (LMMs) into the continuous first-person visual streams of individual users, the evolution of these models into true personal assistants hinges on visual personalization: the ability to reason over visual information unique to the wearer. We formalize this capability as Personal Visual Context Learning (Personal VCL), the prompt-time capability of using user-specific visual context to resolve personalized queries. To systematically evaluate this, we present Personal-VCL-Bench, a comprehensive benchmark capturing the personal visual world across persons, objects, and behaviors. Our analysis of frontier LMMs identifies a profound context utilization gap, revealing that the mechanisms for leveraging visual evidence, as well as aggregating multiple visual observations, remain critically understudied. Motivated by these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
