HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

Darsh Kachroo; Adriana Caraeni; Arjun Prasaath Anbazhagan; Brennan Lagasse; Kevin Zhu

arXiv:2604.20140·cs.AI·April 23, 2026

HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

Darsh Kachroo, Adriana Caraeni, Arjun Prasaath Anbazhagan, Brennan Lagasse, Kevin Zhu

PDF

TL;DR

HiPO introduces a hierarchical preference optimization method that enhances large language models' reasoning by segmenting responses and optimizing preferences at each level, improving performance on complex tasks.

Contribution

It combines the strengths of preference learning and structured reasoning by segmenting responses and applying DPO at each segment, a novel approach in LLM fine-tuning.

Findings

01

Models trained with HiPO outperform others on math benchmarks.

02

HiPO improves logical flow and consistency in generated responses.

03

Segment-specific training enhances reasoning capabilities.

Abstract

Direct Preference Optimization (DPO) is an effective framework for aligning large language models with human preferences, but it struggles with complex reasoning tasks. DPO optimizes for the likelihood of generating preferred over dispreferred responses in their entirety and lacks the granularity to provide feedback on subsections of many-step solutions typical of reasoning tasks. Existing methods excel at either stable preference learning (e.g., DPO variants like KTO and RSO) or structured reasoning (e.g., ReMA's multi-agent RL framework, Tree of Thoughts), but fail to merge these complementary strengths. We propose HiPO (Hierarchical Preference Optimization), an extension of DPO that separates responses into reasoning segments (query clarification and context, reasoning steps, and answer) and computes loss as a weighted sum of the DPO loss for each segment. Our approach enables…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.