PerPO: Perceptual Preference Optimization via Discriminative Rewarding
Zining Zhu, Liang Zhao, Kangheng Lin, Jinze Yang, En Yu, Chenglong, Liu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang

TL;DR
PerPO introduces a perception alignment method for multimodal large language models that improves visual discrimination, reduces reward hacking, and maintains generative abilities through discriminative rewarding and preference optimization.
Contribution
This work proposes PerPO, a novel perception alignment technique for MLLMs that combines discriminative rewarding with listwise preference optimization, enhancing visual discrimination and robustness.
Findings
Significantly improves visual discrimination in MLLMs
Reduces image-unconditional reward hacking
Maintains generative performance across tasks
Abstract
This paper presents Perceptual Preference Optimization (PerPO), a perception alignment method aimed at addressing the visual discrimination challenges in generative pre-trained multimodal large language models (MLLMs). To align MLLMs with human visual perception process, PerPO employs discriminative rewarding to gather diverse negative samples, followed by listwise preference optimization to rank them.By utilizing the reward as a quantitative margin for ranking, our method effectively bridges generative preference optimization and discriminative empirical risk minimization. PerPO significantly enhances MLLMs' visual discrimination capabilities while maintaining their generative strengths, mitigates image-unconditional reward hacking, and ensures consistent performance across visual tasks. This work marks a crucial step towards more perceptually aligned and versatile MLLMs. We also hope…
Peer Reviews
Decision·Submitted to ICLR 2025
+ This paper discusses an important topic by investigating the reward optimization issue in MLLMs, and they believe this fundamental issue derives from a misspecified reward definition and reward hacking in DPO. They propose corresponding measures to address issue. + This paper contains many empirical results.
+ While this paper provides many empirical results (e.g. the authors constantly referring to fig.1), the authors only explain their method with intuition, while a more in-depth or theoretical analysis is expected for a paper like this. Only empirical results are not convincing enough. + Since this paper simply uses the scalar task-specific discriminative score as the reward, why don't the authors compare their method to PPO while using the reward they defined? It seems equation (8) also seeks to
1. The writing is clear and smooth, providing great explanation of the background and the proposed method. 2. Compared with DPO and SFT, PerPO enables more sample-efficient alignment for visual discriminative tasks. 3. PerPO also improves general image understanding and mitigates image-unconditional reward hacking.
1. The evaluation of general vision-language comprehension is based on LLaVA-Bench-in-the-Wild (LLaVA$^W$), a very tiny benchmark with less than 100 samples. The scores may not sufficiently reflect the MLLM image understanding ability. Larger, widely adopted benchmarks such as VQAv2 and MM-Bench are preferred. 2. In Section 5.1, it is claimed that "discriminative reward also aligns well with human," but the results are evaluated by GPT-4o, not human users. 3. From Figure 2(b), the performance
The paper has several strengths. Firstly, it addresses an important challenge in the field of MLLMs, i.e., visual discrimination. Secondly, it presents a novel approach (PerPO) that bridges the gap between generative and discriminative functionalities of MLLMs, and does so effectively according to comprehensive empirical evaluations presented in the paper. Moreover, the PerPO framework reduces the reliance on human annotations for model training, which is a substantial contribution towards scala
One significant weakness is that the effectiveness of PerPO highly depends on specific datasets, potentially limiting its generalization. The paper also acknowledges a limitation where complex tasks may still require human annotations, which might not always be feasible. Moreover, there might be a lack of comprehensive experiments demonstrating the performance of PerPO across a spectrum of different domains, which would validate its claim of general applicability further.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsColor perception and design
MethodsALIGN
