TL;DR
This paper introduces a transformer-based model trained on a large-scale art dataset with Iconclass annotations to generate meaningful captions for artworks, addressing unique challenges in art image captioning.
Contribution
It presents a novel dataset and fine-tunes a vision-language model specifically for art images, improving caption relevance in art historical context.
Findings
Generated captions are more relevant to art context than natural image models
The model generalizes well to new artwork collections
Captions exhibit strong relevance to artistic genres
Abstract
Image captioning implies automatically generating textual descriptions of images based only on the visual input. Although this has been an extensively addressed research topic in recent years, not many contributions have been made in the domain of art historical data. In this particular context, the task of image captioning is confronted with various challenges such as the lack of large-scale datasets of image-text pairs, the complexity of meaning associated with describing artworks and the need for expert-level annotations. This work aims to address some of those challenges by utilizing a novel large-scale dataset of artwork images annotated with concepts from the Iconclass classification system designed for art and iconography. The annotations are processed into clean textual description to create a dataset suitable for training a deep neural network model on the image captioning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
