Jarvis: Towards Personalized AI Assistant via Personal KV-Cache Retrieval
Binxiao Xu, Junyu Feng, Shaolin Lu, Yulin Luo, Shilin Yan, Hao Liang, Ming Lu, and Wentao Zhang

TL;DR
Jarvis is a personalized AI assistant framework that retrieves user-specific visual and textual information from KV-Caches to improve answer accuracy, demonstrating state-of-the-art performance in various tasks.
Contribution
Introduces Jarvis, a novel personalized AI assistant framework using KV-Cache retrieval for both textual and visual user data, enhancing response accuracy.
Findings
Achieves state-of-the-art results in visual question answering.
Improves accuracy in text-only tasks with personalized data.
Demonstrates effectiveness of KV-Cache retrieval for personalization.
Abstract
The rapid development of Vision-language models (VLMs) enables open-ended perception and reasoning. Recent works have started to investigate how to adapt general-purpose VLMs into personalized assistants. Even commercial models such as ChatGPT now support model personalization by incorporating user-specific information. However, existing methods either learn a set of concept tokens or train a VLM to utilize user-specific information. However, both pipelines struggle to generate accurate answers as personalized assistants. We introduce Jarvis, an innovative framework for a personalized AI assistant through personal KV-Cache retrieval, which stores user-specific information in the KV-Caches of both textual and visual tokens. The textual tokens are created by summarizing user information into metadata, while the visual tokens are produced by extracting distinct image patches from the…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The proposed framework is lightweight for LVLM personalization. It is a training-free method that addresses the problem of LVLM personalization without spending a brutal concept embedding training in YoLLaVA and MC-LLaVA. 2. The experiments indicate the superior performance of Jarvis compared to the personalization baselines.
There are several weaknesses that should be addressed in this paper: 1. Clarity: This paper does not have a clear presentation, especially in the Method presetation. In Algorithm 1, there are several terms that confuse the reader, including: box $B_m(u,v)$ (Line 287), OpenCLIP Relevance $R_m^{+}$ (Line 282), why we have the background map be the set $\{R_{m,b}^{-}\}_{b\in\mathcal{B}}$ (Line 283). 2. The motivation behind the algorithm is not clearly explained: For example, Algorithm 1 contains a
I like the idea how they reduced latency: by precomputing and reusing external KV caches, Jarvis significantly reduces latency and improves throughput, making it suitable for real-time applications The framework achieves state-of-the-art results in both text-only and visual question answering tasks, particularly excelling in fine-grained, user-specific scenarios.
Personalized VLMs have been extensively studied in recent years; therefore, the overall scope of this paper feels somewhat limited. I encourage the authors to propose something more novel or distinctive. the approach and design is a bit complicated, which not sure about scalability.
- The paper introduces Jarvis, a novel and practical training-free framework for personalization. The core idea of externalizing user-specific concepts into a reusable KV-Cache is elegant. It directly addresses the high latency and context-length limitations of prompt concatenation methods without the overhead of maintaining per-user model parameters. - The paper provides strong evidence of practical benefits, showcasing an order-of-magnitude reduction in latency and a corresponding increase i
- The experiments primarily focus on single-concept personalization within a given session. A key challenge for a real-world AI assistant is handling queries that involve interactions between multiple personalized concepts (e.g., "my dog," "my daughter," "my car"). It is unclear how the proposed KV-Cache retrieval and concatenation would scale in complexity and performance when a query ambiguously references several distinct entities. - The paper presents evidence construction as a one-time, of
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)
