Auto-Encoding Scene Graphs for Image Captioning
Xu Yang, Kaihua Tang, Hanwang Zhang, Jianfei Cai

TL;DR
This paper introduces Scene Graph Auto-Encoder (SGAE), a novel approach that incorporates language inductive bias via scene graphs and shared dictionaries to improve image captioning, achieving state-of-the-art results on MS-COCO.
Contribution
The paper presents a new SGAE framework that transfers language priors across vision and language domains using scene graphs and shared dictionaries, enhancing captioning performance.
Findings
Achieved 127.8 CIDEr-D on MS-COCO, surpassing previous models.
The shared dictionary effectively transfers language bias across domains.
Single-model SGAE outperforms ensemble models on benchmark.
Abstract
We propose Scene Graph Auto-Encoder (SGAE) that incorporates the language inductive bias into the encoder-decoder image captioning framework for more human-like captions. Intuitively, we humans use the inductive bias to compose collocations and contextual inference in discourse. For example, when we see the relation `person on bike', it is natural to replace `on' with `ride' and infer `person riding bike on a road' even the `road' is not evident. Therefore, exploiting such bias as a language prior is expected to help the conventional encoder-decoder models less likely overfit to the dataset bias and focus on reasoning. Specifically, we use the scene graph --- a directed graph () where an object node is connected by adjective nodes and relationship nodes --- to represent the complex structural layout of both image () and sentence (). In the textual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
