Object Counts! Bringing Explicit Detections Back into Image Captioning
Josiah Wang, Pranava Madhyastha, Lucia Specia

TL;DR
This paper demonstrates that explicitly incorporating object detections into image captioning enhances interpretability and understanding of how different object features contribute to caption generation.
Contribution
It reintroduces explicit object detections into image captioning, analyzing their role and impact on model interpretability and performance.
Findings
Explicit detections provide rich semantic cues.
Object frequency, size, and position are key factors.
Different object categories influence captioning differently.
Abstract
The use of explicit object detectors as an intermediate step to image captioning - which used to constitute an essential stage in early work - is often bypassed in the currently dominant end-to-end approaches, where the language model is conditioned directly on a mid-level image embedding. We argue that explicit detections provide rich semantic information, and can thus be used as an interpretable representation to better understand why end-to-end image captioning systems work well. We provide an in-depth analysis of end-to-end image captioning by exploring a variety of cues that can be derived from such object detections. Our study reveals that end-to-end image captioning systems rely on matching image representations to generate captions, and that encoding the frequency, size and position of objects are complementary and all play a role in forming a good image representation. It also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
