Improving Image Captioning by Mimicking Human Reformulation Feedback at Inference-time
Uri Berger, Omri Abend, Lea Frermann, Gabriel Stanovsky

TL;DR
This paper introduces a novel inference-time feedback method for image captioning that mimics human reformulation feedback, improving caption quality especially in low-quality and non-English scenarios without retraining the core model.
Contribution
The paper proposes a new inference-time feedback approach using human reformulations, enhancing existing captioning models without retraining and demonstrating significant improvements in non-English and style transfer tasks.
Findings
Improved caption quality with reformulation feedback, especially for low-quality captions.
State-of-the-art results in German image captioning and English style transfer.
Human validation confirms specific axes of improvement.
Abstract
Incorporating automatically predicted human feedback into the process of training generative models has attracted substantial recent interest, while feedback at inference time has received less attention. The typical feedback at training time, i.e., preferences of choice given two samples, does not naturally transfer to the inference phase. We introduce a novel type of feedback -- caption reformulations -- and train models to mimic reformulation feedback based on human annotations. Our method does not require training the image captioning model itself, thereby demanding substantially less computational effort. We experiment with two types of reformulation feedback: first, we collect a dataset of human reformulations that correct errors in the generated captions. We find that incorporating reformulation models trained on this data into the inference phase of existing image captioning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
