SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization

Huashan Sun; Shengyi Liao; Yansen Han; Yu Bai; Yang Gao; Cheng Fu; Weizhou Shen; Fanqi Wan; Ming Yan; Ji Zhang; Fei Huang

arXiv:2505.11166·cs.CL·October 14, 2025

SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization

Huashan Sun, Shengyi Liao, Yansen Han, Yu Bai, Yang Gao, Cheng Fu, Weizhou Shen, Fanqi Wan, Ming Yan, Ji Zhang, Fei Huang

PDF

Open Access 3 Reviews

TL;DR

SoLoPO is a framework that improves long-context understanding in large language models by optimizing preference alignment between short and long contexts, leading to better generalization and efficiency.

Contribution

It introduces a novel decoupled preference optimization method that enhances long-context capabilities and transferability in LLMs.

Findings

01

Improves length and domain generalization on long-context benchmarks.

02

Enhances computational and memory efficiency.

03

Compatible with mainstream preference optimization algorithms.

Abstract

Despite advances in pretraining with extended context lengths, large language models (LLMs) still face challenges in effectively utilizing real-world long-context information, primarily due to insufficient long-context alignment caused by data quality issues, training inefficiencies, and the lack of well-designed optimization objectives. To address these limitations, we propose a framework named $S$ h $o$ rt-to- $Lo$ ng $P$ reference $O$ ptimization ( $SoLoPO$ ), decoupling long-context preference optimization (PO) into two components: short-context PO and short-to-long reward alignment (SoLo-RA), supported by both theoretical and empirical evidence. Specifically, short-context PO leverages preference pairs sampled from short contexts to enhance the model's contextual knowledge utilization ability. Meanwhile, SoLo-RA explicitly encourages…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1.The proposed decoupling framework (short-context PO + reward alignment) is elegant, theoretically grounded, and easy to integrate into existing RLHF/PO pipelines. 2.The theoretical formulation clearly explains how SoLoPO approximates the long-context objective through an upper bound, providing a solid foundation for the method. 3.The “chosen-only” SoLo-RA variant is an insightful practical contribution that reduces instability and significantly cuts training cost while maintaining effectiven

Weaknesses

1.The paper could provide more qualitative analysis or visualization to show how SoLoPO improves long-context reasoning (e.g., attention heatmaps or retrieved key information patterns). 2.The framework assumes that short contexts can fully preserve essential information; performance may degrade if the summarization or compression is imperfect, which is not deeply discussed.

Reviewer 02Rating 4Confidence 3

Strengths

1. **Novel framework.** Proposes the SoLoPO framework to transfer short-context preference optimization capabilities to long-context alignment. 2. **Theoretical foundation.** The method is supported by solid theoretical results that justify the proposed decoupling. 3. **General applicability.** The framework can be integrated with multiple preference alignment methods, showing consistent improvements across them. 4. **Strong long-context performance.** The chosen-only SoLoPO variant consist

Weaknesses

1. **Degraded short-context performance.** The method shows reduced performance on the short-context Open LLM Leaderboard. Lines 103 and 377 claim that SoLoPO maintains short-context performance, yet Table 4 indicates otherwise. SoLoPO underperforms the PO baseline in 16 out of 24 datasets. 2. **Limited intuition for theoretical results.** Although the theory is sound, the paper should do a better job providing intuition on *why* the decoupling leads to improved long-context performance (see Q

Reviewer 03Rating 6Confidence 4

Strengths

1. A theoretical decomposition of long-context PO into short-context PO and SoLo-RA. 2. The idea of decoupling long-context alignment into short-context reasoning and cross-context reward alignment is novel and well-motivated. 3. Offers a practical and scalable solution with clear efficiency gains (e.g., 2.1× longer trainable sequences, 52% runtime reduction).

Weaknesses

1. The theory relies on the redundancy hypothesis and Assumption 1, which, while empirically supported, may not hold for all long-context tasks (e.g., when all context is relevant). 2. The synthetic dataset construction (mixing relevant and irrelevant documents) is simple but may not reflect real-world long-context complexity.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsParrot optimizer: Algorithm and applications to medical problems