PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost

Junkeun Yi; Damon Mosk-Aoyama; Baihe Huang; Ritu Gala; Charles Wang; Sugam Dipak Devare; Khushi Bhardwaj; Abhibha Gupta; Oleksii Kuchaiev; Jiantao Jiao; Jian Zhang; Venkat Srinivasan

arXiv:2603.21383·cs.AI·March 24, 2026

PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost

Junkeun Yi, Damon Mosk-Aoyama, Baihe Huang, Ritu Gala, Charles Wang, Sugam Dipak Devare, Khushi Bhardwaj, Abhibha Gupta, Oleksii Kuchaiev, Jiantao Jiao, Jian Zhang, Venkat Srinivasan

PDF

Open Access

TL;DR

PivotRL is a novel post-training framework that combines supervised fine-tuning's efficiency with reinforcement learning's out-of-domain robustness, achieving higher accuracy with fewer computational resources.

Contribution

It introduces PivotRL, a method that leverages on-policy rollouts and reward-based pivots to enhance policy performance while reducing compute costs.

Findings

01

+4.17% in-domain accuracy over SFT

02

+10.04% out-of-domain accuracy over SFT

03

Achieves similar accuracy to E2E RL with 4x fewer rollout turns

Abstract

Post-training for long-horizon agentic tasks has a tension between compute efficiency and generalization. While supervised fine-tuning (SFT) is compute efficient, it often suffers from out-of-domain (OOD) degradation. Conversely, end-to-end reinforcement learning (E2E RL) preserves OOD capabilities, but incurs high compute costs due to many turns of on-policy rollout. We introduce PivotRL, a novel framework that operates on existing SFT trajectories to combine the compute efficiency of SFT with the OOD accuracy of E2E RL. PivotRL relies on two key mechanisms: first, it executes local, on-policy rollouts and filters for pivots: informative intermediate turns where sampled actions exhibit high variance in outcomes; second, it utilizes rewards for functional-equivalent actions rather than demanding strict string matching with the SFT data demonstration. We theoretically show that these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications