Diffusion Based Augmentation for Captioning and Retrieval in Cultural Heritage
Dario Cioni, Lorenzo Berlincioni, Federico Becattini, Alberto del, Bimbo

TL;DR
This paper proposes a diffusion-based data augmentation method that generates diverse artwork variations conditioned on captions, improving model training for cultural heritage applications despite limited data and domain shifts.
Contribution
It introduces a novel generative vision-language augmentation approach specifically designed for cultural heritage datasets, addressing data scarcity and domain shift issues.
Findings
Enhanced dataset diversity improves model performance.
Generated variations lead to better captioning accuracy.
Bridging domain gaps enhances visual and linguistic understanding.
Abstract
Cultural heritage applications and advanced machine learning models are creating a fruitful synergy to provide effective and accessible ways of interacting with artworks. Smart audio-guides, personalized art-related content and gamification approaches are just a few examples of how technology can be exploited to provide additional value to artists or exhibitions. Nonetheless, from a machine learning point of view, the amount of available artistic data is often not enough to train effective models. Off-the-shelf computer vision modules can still be exploited to some extent, yet a severe domain shift is present between art images and standard natural image datasets used to train such models. As a result, this can lead to degraded performance. This paper introduces a novel approach to address the challenges of limited annotated data and domain shifts in the cultural heritage domain. By…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
