Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms
Miaosen Zhang, Yixuan Wei, Zhen Xing, Yifei Ma, Zuxuan Wu, Ji Li,, Zheng Zhang, Qi Dai, Chong Luo, Xin Geng, Baining Guo

TL;DR
This paper introduces a novel approach to align vision models with human aesthetic standards in retrieval systems by leveraging LLM reasoning, reinforcement learning, and new benchmarks, significantly improving aesthetic alignment.
Contribution
It proposes a preference-based reinforcement learning method that distills knowledge from LLMs and aesthetic models to better align vision models with human aesthetics, along with new benchmarks for evaluation.
Findings
Enhanced aesthetic alignment in vision models demonstrated by improved metrics.
Utilization of LLM reasoning extends aesthetic expectations beyond low-level features.
Introduction of the HPIR dataset for robust evaluation of aesthetic alignment.
Abstract
Modern vision models are trained on very large noisy datasets. While these models acquire strong capabilities, they may not follow the user's intent to output the desired results in certain aspects, e.g., visual aesthetic, preferred style, and responsibility. In this paper, we target the realm of visual aesthetics and aim to align vision models with human aesthetic standards in a retrieval system. Advanced retrieval systems usually adopt a cascade of aesthetic models as re-rankers or filters, which are limited to low-level features like saturation and perform poorly when stylistic, cultural or knowledge contexts are involved. We find that utilizing the reasoning ability of large language models (LLMs) to rephrase the search query and extend the aesthetic expectations can make up for this shortcoming. Based on the above findings, we propose a preference-based reinforcement learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Visual Attention and Saliency Detection
MethodsALIGN
