Aesthetic Visual Question Answering of Photographs
Xin Jin, Wu Zhou, Xinghui Zhou, Shuai Cui, Le Zhang, Jianwen Lv, Shu, Zhao

TL;DR
This paper introduces the novel task of aesthetic visual question answering (AVQA) for photographs, creating the first dataset and methods to incorporate subjectivity and improve model accuracy in aesthetic assessment.
Contribution
It presents the first AVQA dataset, AesVQA, and proposes methods to enhance model performance by adjusting data distribution, integrating subjective aesthetic judgments.
Findings
Proposed methods improve AVQA model accuracy.
Created the first large-scale aesthetic VQA dataset.
Outperforms existing VQA models on aesthetic tasks.
Abstract
Aesthetic assessment of images can be categorized into two main forms: numerical assessment and language assessment. Aesthetics caption of photographs is the only task of aesthetic language assessment that has been addressed. In this paper, we propose a new task of aesthetic language assessment: aesthetic visual question and answering (AVQA) of images. If we give a question of images aesthetics, model can predict the answer. We use images from \textit{www.flickr.com}. The objective QA pairs are generated by the proposed aesthetic attributes analysis algorithms. Moreover, we introduce subjective QA pairs that are converted from aesthetic numerical labels and sentiment analysis from large-scale pre-train models. We build the first aesthetic visual question answering dataset, AesVQA, that contains 72,168 high-quality images and 324,756 pairs of aesthetic questions. Two methods for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques
