Visually Grounded Concept Composition
Bowen Zhang, Hexiang Hu, Linlu Qiu, Peter Shaw, Fei Sha

TL;DR
This paper introduces a novel framework for composing complex concepts grounded in images, using a Concept and Relation Graph and a neural Composer to improve text-to-image matching accuracy, especially for unseen compound concepts.
Contribution
The paper proposes the Concept and Relation Graph (CRG) and Composer neural network for grounded concept composition, enhancing robustness in visual grounding tasks.
Findings
Improved text-to-image matching accuracy with compositional concepts.
Effective grounding of both primitive and complex concepts.
Enhanced performance on data with high compound divergence.
Abstract
We investigate ways to compose complex concepts in texts from primitive ones while grounding them in images. We propose Concept and Relation Graph (CRG), which builds on top of constituency analysis and consists of recursively combined concepts with predicate functions. Meanwhile, we propose a concept composition neural network called Composer to leverage the CRG for visually grounded concept learning. Specifically, we learn the grounding of both primitive and all composed concepts by aligning them to images and show that learning to compose leads to more robust grounding results, measured in text-to-image matching accuracy. Notably, our model can model grounded concepts forming at both the finer-grained sentence level and the coarser-grained intermediate level (or word-level). Composer leads to pronounced improvement in matching accuracy when the evaluation data has significant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
