RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models
Yeongtak Oh, Dohyun Chung, Juhyeon Shin, Sangha Park, Johan Barthelemy, Jisoo Mok, and Sungroh Yoon

TL;DR
This paper introduces RePIC, a reinforcement learning-based post-training method that significantly improves personalized multi-modal language models' ability to generate faithful, multi-concept image captions, outperforming traditional supervised fine-tuning approaches.
Contribution
RePIC is the first RL-based post-training framework for enhancing personalization in multi-modal language models, addressing limitations of supervised fine-tuning in complex captioning tasks.
Findings
RePIC outperforms SFT baselines in multi-concept image captioning.
Reinforcement learning improves visual recognition and generation fidelity.
Method enhances personalization in real-world scenarios.
Abstract
Recent multi-modal large language models (MLLMs) often struggle to generate personalized image captions, even when trained on high-quality captions. In this work, we observe that such limitations persist in existing post-training-based MLLM personalization methods. Specifically, despite being post-tuned with large-scale caption data through supervised fine-tuning (SFT), these models frequently fail to produce faithful descriptions in real-world scenarios, such as multi-concept image captioning. However, acquiring large-scale, high-quality captions for such complex settings is both costly and difficult. To address the data-centric nature of SFT, we propose a reinforcement learning (RL)-based post-training framework. To the best of our knowledge, this is the first RL-based approach to post-train MLLMs for personalized image captioning. Our method significantly enhances both visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
