TL;DR
This paper investigates the effectiveness of scene graphs in image captioning, finding that current noisy scene graph models do not significantly improve caption quality, but high-quality scene graphs can offer notable gains.
Contribution
The study introduces a conditional graph attention network for scene graph integration and provides a comprehensive empirical analysis of scene graph utility in captioning.
Findings
No significant improvement with current scene graph models
High-quality scene graphs can improve captioning metrics
Scene graph noise impacts caption quality
Abstract
Many top-performing image captioning models rely solely on object features computed with an object detection model to generate image descriptions. However, recent studies propose to directly use scene graphs to introduce information about object relations into captioning, hoping to better describe interactions between objects. In this work, we thoroughly investigate the use of scene graphs in image captioning. We empirically study whether using additional scene graph encoders can lead to better image descriptions and propose a conditional graph attention network (C-GAT), where the image captioning decoder state is used to condition the graph updates. Finally, we determine to what extent noise in the predicted scene graphs influence caption quality. Overall, we find no significant difference between models that use scene graph features and models that only use object detection features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
