RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning

Tzu-Heng Huang; Sirajul Salekin; Javier Movellan; Frederic Sala; Manjot Bilkhu

arXiv:2603.09160·cs.CV·March 11, 2026

RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning

Tzu-Heng Huang, Sirajul Salekin, Javier Movellan, Frederic Sala, Manjot Bilkhu

PDF

Open Access

TL;DR

RubiCap introduces a reinforcement learning framework guided by language model-derived rubrics to improve dense image captioning, achieving state-of-the-art results with better diversity and efficiency.

Contribution

The paper presents RubiCap, a novel RL approach that uses LLM-generated rubrics for structured reward signals, enhancing caption quality and diversity in open-ended tasks.

Findings

01

Achieves highest win rates on CapArena benchmarks.

02

Outperforms supervised distillation and prior RL methods.

03

Produces stronger pretrained vision-language models.

Abstract

Dense image captioning is critical for cross-modal alignment in vision-language pretraining and text-to-image generation, but scaling expert-quality annotations is prohibitively expensive. While synthetic captioning via strong vision-language models (VLMs) is a practical alternative, supervised distillation often yields limited output diversity and weak generalization. Reinforcement learning (RL) could overcome these limitations, but its successes have so far been concentrated in verifiable domains that rely on deterministic checkers -- a luxury not available in open-ended captioning. We address this bottleneck with RubiCap, a novel RL framework that derives fine-grained, sample-specific reward signals from LLM-written rubrics. RubiCap first assembles a diverse committee of candidate captions, then employs an LLM rubric writer to extract consensus strengths and diagnose deficiencies in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Text Readability and Simplification