Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks   and Algorithms

Miaosen Zhang; Yixuan Wei; Zhen Xing; Yifei Ma; Zuxuan Wu; Ji Li,; Zheng Zhang; Qi Dai; Chong Luo; Xin Geng; Baining Guo

arXiv:2406.09397·cs.CV·June 14, 2024

Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms

Miaosen Zhang, Yixuan Wei, Zhen Xing, Yifei Ma, Zuxuan Wu, Ji Li,, Zheng Zhang, Qi Dai, Chong Luo, Xin Geng, Baining Guo

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel approach to align vision models with human aesthetic standards in retrieval systems by leveraging LLM reasoning, reinforcement learning, and new benchmarks, significantly improving aesthetic alignment.

Contribution

It proposes a preference-based reinforcement learning method that distills knowledge from LLMs and aesthetic models to better align vision models with human aesthetics, along with new benchmarks for evaluation.

Findings

01

Enhanced aesthetic alignment in vision models demonstrated by improved metrics.

02

Utilization of LLM reasoning extends aesthetic expectations beyond low-level features.

03

Introduction of the HPIR dataset for robust evaluation of aesthetic alignment.

Abstract

Modern vision models are trained on very large noisy datasets. While these models acquire strong capabilities, they may not follow the user's intent to output the desired results in certain aspects, e.g., visual aesthetic, preferred style, and responsibility. In this paper, we target the realm of visual aesthetics and aim to align vision models with human aesthetic standards in a retrieval system. Advanced retrieval systems usually adopt a cascade of aesthetic models as re-rankers or filters, which are limited to low-level features like saturation and perform poorly when stylistic, cultural or knowledge contexts are involved. We find that utilizing the reasoning ability of large language models (LLMs) to rephrase the search query and extend the aesthetic expectations can make up for this shortcoming. Based on the above findings, we propose a preference-based reinforcement learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms· slideslive

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Visual Attention and Saliency Detection

MethodsALIGN