Learning to Align Human Code Preferences
Xin Yin, Chao Ni, Xiaohu Yang

TL;DR
This paper investigates training strategies for aligning large language models with human code preferences, proposing an adaptive method that improves performance across diverse scenarios.
Contribution
It introduces Adaptive Preference Optimization (APO), a novel dynamic training approach that combines theoretical insights and empirical validation for better code preference alignment.
Findings
APO outperforms existing SFT and S&D strategies in various tasks.
Theoretical analysis supports the effectiveness of APO in different preference scenarios.
Extensive experiments validate APO's superior or comparable performance.
Abstract
Large Language Models (LLMs) have demonstrated remarkable potential in automating software development tasks. While recent advances leverage Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to align models with human preferences, the optimal training strategy remains unclear across diverse code preference scenarios. This paper systematically investigates the roles of SFT and DPO in aligning LLMs with different code preferences. Through both theoretical analysis and empirical observation, we hypothesize that SFT excels in scenarios with objectively verifiable optimal solutions, while applying SFT followed by DPO (S&D) enables models to explore superior solutions in scenarios without objectively verifiable optimal solutions. Based on the analysis and experimental evidence, we propose Adaptive Preference Optimization (APO), a dynamic integration approach that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
