UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment
Hantao Zhou, Longxiang Tang, Rui Yang, Guanyi Qin, Yan Zhang, Yutao Li, Xiu Li, Runze Hu, Guangtao Zhai

TL;DR
UniQA is a unified vision-language pre-training framework that enhances image quality and aesthetic assessment by leveraging multimodal data, generating high-quality text descriptions, and employing adapters for improved downstream task performance.
Contribution
The paper introduces UniQA, a novel unified pre-training approach that combines IQA and IAA tasks using multimodal large language models and data purification techniques.
Findings
Achieves high performance on classical IQA and IAA tasks
Effective in few-label IQA scenarios
Demonstrates versatility across multiple image assessment tasks
Abstract
Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) aim to simulate human subjective perception of image visual quality and aesthetic appeal. Despite distinct learning objectives, they have underlying interconnectedness due to consistent human assessment perception. In this paper, we propose Unified vision-language pre-training of Quality and Aesthetics (UniQA}), to extract useful and common representations from two tasks, thereby benefiting them simultaneously. However, the lack of text in the IQA datasets and the textual noise in the IAA datasets pose severe challenges for multimodal pre-training. To address this, we (1) utilize multimodal large language models (MLLMs) to generate high-quality text descriptions; (2) use the generated text for IAA as metadata to purify noisy IAA data. To effectively adapt the pre-trained UniQA to downstream tasks, we further propose a…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. This paper includes comprehensive datasets, encompassing nearly all existing IQA and IAA datasets, providing robust validation for the effectiveness of the proposed method. 2. The methodology of the algorithm is described in a clear and straightforward manner, with concise and easily understandable language. 3. The paper presents an extensive set of experiments and rich visualizations, thoroughly validating each module within the algorithm's design.
1. The paper’s motivation appears to lack practical significance or has not been convincingly demonstrated. 2. While the proposed approach is intricate, its actual innovation is minimal. 3. The related work section includes methods that are either outdated or lack representativeness in the current IQA and IAA research. 4. Beyond metric improvements, the proposed method lacks substantial inspirational value for future studies. 5. The paper is missing some essential experiments that could substant
1. In this paper, a high-quality image-text dataset about image quality and aesthetics is constructed based on the assistance of MLLMs, which is valuable. 2.The organizational structure of the article is clear and the content is complete. The writing is clear and easy to follow. 3. The motivation to "extract mutually beneficial and effective representations for both IQA and IAA tasks" in this paper is plausible. 4. This paper proposes an effective data purification strategy that refines the raw
1.The authors highlight that the motivation of this paper is to "extract mutually beneficial and effective representations for both IQA and IAA tasks." However, throughout the paper, neither the proposed dataset nor the proposed method fully explore the mutually beneficial representations for IQA and IAA tasks; instead, they only address the creation of effective representations for these tasks. Specifically, the method proposed in this paper learns a shared feature representation for IQA and IA
1) This paper is well-written. 2) This paper achieves the SOTA performance on IQA and IAA tasks
The overall framework and concept are simple and lack novelty. 1) Directly utilizing MLLMs to generate text is now common practice. 2) The pre-training design is straightforward and typical of MLLMs. 3) This paper reports only the performance on IQA and IAA, offering a limited range of downstream tasks.
The refine module proposed in this paper can transform messy data into a quality-related format, thereby ensuring that the model is highly consistent with the supervised experience of the human eye, and has the potential to be applied in future IQA and IAA tasks. This can achieve the migration of large-scale, general descriptive datasets to quality-related specialized datasets, which can promote progress in the field of IQA. The method proposed in this paper can be used for both traditional vis
The core implementation of UniQA is to predict the probability of The image quality is {bad, poor, fair, good, perfect} and then fuse them. This is not a completely new paradigm. As far as I know, this first appeared in Q-Align. However, the author only reviewed this article without comparing them in experiments. Considering the similarities between the two, it is necessary for the author to conduct comparative experiments in Table 1 and emphasize the differences between the two. It is a good p
1. UniQA effectively combines IQA and IAA tasks, extracting shared representations, leading to efficient and comprehensive visual assessment capabilities. 2. Using MLLMs to generate high-quality text descriptions enriches the dataset. 3. The lightweight Multi-Cue Integration Adapter allows UniQA to adapt efficiently to various downstream IQA and IAA tasks with minimal parameter adjustments.
1. Although this paper effectively utilizes MLLM-generated text for dataset construction, the generated descriptions tend to have similar structures and expressions, resulting in limited text diversity. The model's performance heavily depends on the quality of MLLM-generated text, which may introduce noise or bias, especially as MLLMs may produce overly positive or vague evaluations when generating image descriptions. 2. While this method aligns IQA and IAA datasets by unifying them on a common
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · 3D Surveying and Cultural Heritage
MethodsAdapter
