RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning
Tzu-Heng Huang, Sirajul Salekin, Javier Movellan, Frederic Sala, Manjot Bilkhu

TL;DR
RubiCap introduces a reinforcement learning framework guided by language model-derived rubrics to improve dense image captioning, achieving state-of-the-art results with better diversity and efficiency.
Contribution
The paper presents RubiCap, a novel RL approach that uses LLM-generated rubrics for structured reward signals, enhancing caption quality and diversity in open-ended tasks.
Findings
Achieves highest win rates on CapArena benchmarks.
Outperforms supervised distillation and prior RL methods.
Produces stronger pretrained vision-language models.
Abstract
Dense image captioning is critical for cross-modal alignment in vision-language pretraining and text-to-image generation, but scaling expert-quality annotations is prohibitively expensive. While synthetic captioning via strong vision-language models (VLMs) is a practical alternative, supervised distillation often yields limited output diversity and weak generalization. Reinforcement learning (RL) could overcome these limitations, but its successes have so far been concentrated in verifiable domains that rely on deterministic checkers -- a luxury not available in open-ended captioning. We address this bottleneck with RubiCap, a novel RL framework that derives fine-grained, sample-specific reward signals from LLM-written rubrics. RubiCap first assembles a diverse committee of candidate captions, then employs an LLM rubric writer to extract consensus strengths and diagnose deficiencies in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Text Readability and Simplification
