RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models

Yeongtak Oh; Dohyun Chung; Juhyeon Shin; Sangha Park; Johan Barthelemy; Jisoo Mok; and Sungroh Yoon

arXiv:2506.18369·cs.CV·October 13, 2025

RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models

Yeongtak Oh, Dohyun Chung, Juhyeon Shin, Sangha Park, Johan Barthelemy, Jisoo Mok, and Sungroh Yoon

PDF

2 Datasets 1 Video

TL;DR

This paper introduces RePIC, a reinforcement learning-based post-training method that significantly improves personalized multi-modal language models' ability to generate faithful, multi-concept image captions, outperforming traditional supervised fine-tuning approaches.

Contribution

RePIC is the first RL-based post-training framework for enhancing personalization in multi-modal language models, addressing limitations of supervised fine-tuning in complex captioning tasks.

Findings

01

RePIC outperforms SFT baselines in multi-concept image captioning.

02

Reinforcement learning improves visual recognition and generation fidelity.

03

Method enhances personalization in real-world scenarios.

Abstract

Recent multi-modal large language models (MLLMs) often struggle to generate personalized image captions, even when trained on high-quality captions. In this work, we observe that such limitations persist in existing post-training-based MLLM personalization methods. Specifically, despite being post-tuned with large-scale caption data through supervised fine-tuning (SFT), these models frequently fail to produce faithful descriptions in real-world scenarios, such as multi-concept image captioning. However, acquiring large-scale, high-quality captions for such complex settings is both costly and difficult. To address the data-centric nature of SFT, we propose a reinforcement learning (RL)-based post-training framework. To the best of our knowledge, this is the first RL-based approach to post-train MLLMs for personalized image captioning. Our method significantly enhances both visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Yeongtak/RePIC-training-data
dataset· 7 dl
7 dl

Videos

RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models· slideslive