TL;DR
This paper introduces a new large-scale dataset for aesthetic image captioning created through an automatic cleaning process, and proposes a weakly supervised method to learn aesthetic features without needing detailed annotations.
Contribution
It presents a novel dataset, AVA-Captions, and a weakly supervised training strategy for aesthetic feature extraction in image captioning.
Findings
The dataset contains 230,000 images with 5 captions each.
The weakly supervised method effectively learns aesthetic representations.
Automatic metrics and subjective evaluations validate the approach.
Abstract
Aesthetic image captioning (AIC) refers to the multi-modal task of generating critical textual feedbacks for photographs. While in natural image captioning (NIC), deep models are trained in an end-to-end manner using large curated datasets such as MS-COCO, no such large-scale, clean dataset exists for AIC. Towards this goal, we propose an automatic cleaning strategy to create a benchmarking AIC dataset, by exploiting the images and noisy comments easily available from photography websites. We propose a probabilistic caption-filtering method for cleaning the noisy web-data, and compile a large-scale, clean dataset "AVA-Captions", (230, 000 images with 5 captions per image). Additionally, by exploiting the latent associations between aesthetic attributes, we propose a strategy for training the convolutional neural network (CNN) based visual feature extractor, the first component of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
