Small-Large Collaboration: Training-efficient Concept Personalization for Large VLM using a Meta Personalized Small VLM
Sihan Yang, Huitong Ji, Shaolin Lu, Jiayi Chen, Binxiao Xu, Ming Lu, Yuanxing Zhang, Wenhui Dong, Wentao Zhang

TL;DR
This paper introduces a training-efficient collaborative framework called Small-Large Collaboration (SLC) that personalizes large vision-language models by leveraging a meta personalized small VLM, combining the strengths of both models for improved reasoning and personalization.
Contribution
The paper proposes the first training-efficient framework for personalizing large VLMs using a meta personalized small VLM, applicable to both open-source and closed-source models.
Findings
SLC effectively personalizes large VLMs across various benchmarks.
The framework reduces training costs by only training a small VLM.
Experimental results show improved reasoning and personalization capabilities.
Abstract
Personalizing Vision-Language Models (VLMs) to transform them into daily assistants has emerged as a trending research direction. However, leading companies like OpenAI continue to increase model size and develop complex designs such as the chain of thought (CoT). While large VLMs are proficient in complex multi-modal understanding, their high training costs and limited access via paid APIs restrict direct personalization. Conversely, small VLMs are easily personalized and freely available, but they lack sufficient reasoning capabilities. Inspired by this, we propose a novel collaborative framework named Small-Large Collaboration (SLC) for large VLM personalization, where the small VLM is responsible for generating personalized information, while the large model integrates this personalized information to deliver accurate responses. To effectively incorporate personalized information,…
Peer Reviews
Decision·Submitted to ICLR 2026
1. It's reasonable to use small VLM to reduce the training cost, since not all tasks require the full capacity of a large VLM. 2. The proposed test-time reflection strategy effectively reduces hallucination from the small VLM, which help the large VLM to achieve better performance. 3. The experimental results demonstrate that the SLC framework can achieve competitive performance with significantly reduced training costs compared to traditional fine-tuning methods. 4. The framework can support c
1. The paper could improve in organization. Most of the training details of the meta-personalized small VLM and the construction of the proposed SQA dataset are deferred to the appendix, making it difficult for readers to fully understand the implementation of the SLC framework. The authors should include more implementation details of both components in the main text to improve readability and reproducibility. 2. The proposed meta-personalization strategy is similar to existing works like "Meta
1. Novel Collaboration Paradigm: The idea of leveraging a small VLM for personalized concept detection and a large VLM for reflection and reasoning is innovative and timely. It effectively addresses the trade-off between training cost and model capability. 2. Training Efficiency: The meta-personalized small VLM, trained only once with LoRA adapters, enables zero-shot adaptation to new user concepts without additional tuning. This results in orders-of-magnitude reduction in training FLOPs. 3. C
1. **Inadequate Handling of False Negatives from the Small VLM:** The current framework only applies test-time reflection to concepts where the small VLM outputs `present = true`. However, for concepts marked as `present = false`, the large VLM performs no further verification. This may lead to missed detections (false negatives) and lower recall, especially when the small VLM fails to recognize a concept due to limited generalization or semantic drift. 2. **Limited Validation of Meta-Concept
1. The modular architecture supports "local small models processing personalized information + cloud large models inference", adapting to edge device scenarios and providing a feasible path for the implementation of VLM personalization. 2. The experiments cover multiple types of tasks and compare mainstream methods, and provide scalability experiments for small models and large models.
1.The core idea relies on the combination of existing technologies, with no breakthrough design, i.e. meta-personalization is based on K-Means clustering and LoRA (both are mature technologies). 2. The closed-source model only tested GPT-4o. Small models only verify 3B-level models such as Qwen2.5-VL-3B and Phi3-V, and do not test small models below 1B. The value of the meta-concept K is only fixed at 10. The performance changes of values such as K=5 and 15 have not been analyzed, lack of confir
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Gaze Tracking and Assistive Technology
