IC3: Image Captioning by Committee Consensus
David M. Chan, Austin Myers, Sudheendra Vijayanarasimhan, David A., Ross, John Canny

TL;DR
IC3 introduces a novel ensemble approach for image captioning that synthesizes multiple viewpoints to produce more comprehensive and informative captions, outperforming existing state-of-the-art models in human and automated evaluations.
Contribution
The paper proposes a new committee consensus method for image captioning that captures diverse scene details, improving caption informativeness and system recall performance.
Findings
IC3 captions are rated as helpful as top models by humans over two-thirds of the time.
IC3 enhances automated recall system performance by up to 84%.
IC3 outperforms single-reference captions and state-of-the-art approaches.
Abstract
If you ask a human to describe an image, they might do so in a thousand different ways. Traditionally, image captioning models are trained to generate a single "best" (most like a reference) image caption. Unfortunately, doing so encourages captions that are "informationally impoverished," and focus on only a subset of the possible details, while ignoring other potentially useful information in the scene. In this work, we introduce a simple, yet novel, method: "Image Captioning by Committee Consensus" (IC3), designed to generate a single caption that captures high-level details from several annotator viewpoints. Humans rate captions produced by IC3 at least as helpful as baseline SOTA models more than two thirds of the time, and IC3 can improve the performance of SOTA automated recall systems by up to 84%, outperforming single human-generated reference captions, and indicating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
