HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs
Darsh Kachroo, Adriana Caraeni, Arjun Prasaath Anbazhagan, Brennan Lagasse, Kevin Zhu

TL;DR
HiPO introduces a hierarchical preference optimization method that enhances large language models' reasoning by segmenting responses and optimizing preferences at each level, improving performance on complex tasks.
Contribution
It combines the strengths of preference learning and structured reasoning by segmenting responses and applying DPO at each segment, a novel approach in LLM fine-tuning.
Findings
Models trained with HiPO outperform others on math benchmarks.
HiPO improves logical flow and consistency in generated responses.
Segment-specific training enhances reasoning capabilities.
Abstract
Direct Preference Optimization (DPO) is an effective framework for aligning large language models with human preferences, but it struggles with complex reasoning tasks. DPO optimizes for the likelihood of generating preferred over dispreferred responses in their entirety and lacks the granularity to provide feedback on subsections of many-step solutions typical of reasoning tasks. Existing methods excel at either stable preference learning (e.g., DPO variants like KTO and RSO) or structured reasoning (e.g., ReMA's multi-agent RL framework, Tree of Thoughts), but fail to merge these complementary strengths. We propose HiPO (Hierarchical Preference Optimization), an extension of DPO that separates responses into reasoning segments (query clarification and context, reasoning steps, and answer) and computes loss as a weighted sum of the DPO loss for each segment. Our approach enables…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
