Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context Reasoning

Xin Guan; Zijian Li; Shen Huang; Pengjun Xie; Jingren Zhou; Jiuxin Cao

arXiv:2601.10306·cs.AI·April 21, 2026

Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context Reasoning

Xin Guan, Zijian Li, Shen Huang, Pengjun Xie, Jingren Zhou, Jiuxin Cao

PDF

TL;DR

EAPO introduces a reward co-evolution approach to improve evidence retrieval and reasoning in long-context RL applications, significantly outperforming existing methods across multiple benchmarks.

Contribution

The paper proposes a novel Evidence-Augmented Policy Optimization framework with reward co-evolution, addressing reward sparsity and evidence extraction challenges in long-context reasoning.

Findings

01

EAPO outperforms SOTA baselines on eight benchmarks.

02

Reward model refinement improves evidence quality during training.

03

Adaptive co-evolution enhances long-context reasoning accuracy.

Abstract

While Reinforcement Learning (RL) has advanced LLM reasoning, applying it to long-context scenarios is hindered by sparsity of outcome rewards. This limitation fails to penalize ungrounded "lucky guesses," leaving the critical process of needle-in-a-haystack evidence retrieval largely unsupervised. To address this, we propose EAPO (Evidence-Augmented Policy Optimization). We first establish the Evidence-Augmented Reasoning paradigm, validating via Tree-Structured Evidence Sampling that precise evidence extraction is the decisive bottleneck for long-context reasoning. Guided by this insight, EAPO introduces a specialized RL algorithm where a reward model computes a Group-Relative Evidence Reward, providing dense process supervision to explicitly improve evidence quality. To sustain accurate supervision throughout training, we further incorporate an Adaptive Reward-Policy Co-Evolution…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.