Sample-efficient LLM Optimization with Reset Replay
Zichuan Liu, Jinyu Wang, Lei Song, Jiang Bian

TL;DR
LoRR is a novel plugin that enhances sample efficiency and mitigates overfitting in preference-based LLM optimization through reset replay and hybrid objectives, improving reasoning benchmarks.
Contribution
Introduces LoRR, a plugin with reset replay and hybrid optimization to boost sample efficiency and reduce overfitting in preference-based LLM training.
Findings
LoRR significantly improves performance on reasoning benchmarks.
An iterative DPO with LoRR rivals complex baselines.
LoRR enables effective learning from limited offline data.
Abstract
Recent advancements in LLM post-training, particularly through reinforcement learning and preference optimization, are key to boosting their reasoning capabilities. However, these methods often suffer from low sample efficiency and a susceptibility to primacy bias, a phenomenon where overfitting to initial experiences diminishes network plasticity and damages the learning process. To address these challenges, we introduce LLM optimization with Reset Replay (LoRR), a general and powerful plugin for enhancing sample efficiency in preference-based optimization. Its core mechanism enables high-replay training to maximize the utility of each data batch. To mitigate overfitting, LoRR orchestrates a periodic reset strategy that reuses the initial data and policy to maintain network plasticity, and further adopts a hybrid optimization objective to better exploit training data. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
