KALE: An Artwork Image Captioning System Augmented with Heterogeneous Graph
Yanbei Jiang, Krista A. Ehinger, Jey Han Lau

TL;DR
KALE is a novel artwork image captioning system that enhances caption quality by integrating artwork metadata through a heterogeneous knowledge graph and a cross-modal alignment loss, outperforming existing models.
Contribution
This work introduces KALE, a new model that combines metadata and knowledge graphs with vision-language techniques for improved artwork captioning.
Findings
KALE achieves superior CIDEr scores compared to state-of-the-art methods.
Incorporating metadata via a knowledge graph enhances caption accuracy.
The cross-modal alignment loss improves the correlation between images and metadata.
Abstract
Exploring the narratives conveyed by fine-art paintings is a challenge in image captioning, where the goal is to generate descriptions that not only precisely represent the visual content but also offer a in-depth interpretation of the artwork's meaning. The task is particularly complex for artwork images due to their diverse interpretations and varied aesthetic principles across different artistic schools and styles. In response to this, we present KALE Knowledge-Augmented vision-Language model for artwork Elaborations), a novel approach that enhances existing vision-language models by integrating artwork metadata as additional knowledge. KALE incorporates the metadata in two ways: firstly as direct textual input, and secondly through a multimodal heterogeneous knowledge graph. To optimize the learning of graph representations, we introduce a new cross-modal alignment loss that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
