BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

Shaokai Ye; Vasileios Saveris; Yihao Qian; Jiaming Hu; Elmira Amirloo; Peter Grasch

arXiv:2605.07394·cs.CV·May 11, 2026

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

Shaokai Ye, Vasileios Saveris, Yihao Qian, Jiaming Hu, Elmira Amirloo, Peter Grasch

PDF

TL;DR

This paper introduces BalCapRL, a balanced reinforcement learning framework for image captioning with multimodal large language models, optimizing multiple caption quality aspects simultaneously.

Contribution

It proposes a novel multi-objective RL approach with reward normalization and length masking to improve caption quality across several dimensions.

Findings

01

Consistently improves caption quality metrics across multiple models.

02

Peak gains of +13.6 in DCScore and +29.0 in CapArena.

03

Enhanced balance between correctness, coverage, and linguistic quality.

Abstract

Image captioning is one of the most fundamental tasks in computer vision. Owing to its open-ended nature, it has received significant attention in the era of multimodal large language models (MLLMs). In pursuit of ever more detailed and accurate captions, recent work has increasingly turned to reinforcement learning (RL). However, existing captioning-RL methods and evaluation metrics often emphasize a narrow notion of caption quality, inducing trade-offs across core dimensions of captioning. For example, utility-oriented objectives can encourage noisy, hallucinated, or overlong captions that improve downstream question answering while harming fluency, whereas arena-style objectives can favor fluent but generic descriptions with limited usefulness. To address this, we propose a more balanced RL framework that jointly optimizes utility-aware correctness, reference coverage, and linguistic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.