AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning
Peilin Wu, Xinlu Zhang, Kun Wan, Wentian Zhao, Gang Wu, Xinya Du, Zhiyu Chen

TL;DR
AMARIS introduces a persistent evaluation memory system that enhances rubric-based reward shaping in reinforcement learning by leveraging long-term training history for more strategic and effective reward adjustments.
Contribution
It proposes a novel long-term memory-augmented framework for rubric-based RL that improves evaluation diagnostics and strategic rubric updates over existing short-term methods.
Findings
AMARIS outperforms baseline methods across multiple domains.
Static and dynamic memory retrieval both contribute to performance gains.
The system adds only about 5% overhead, demonstrating efficiency.
Abstract
Rubric-based reward shaping is an effective method for fine-tuning LLMs via RL, where structured rubrics decompose standard outcome rewards into multiple dimensions to provide richer reward signals. Recent works make the rubrics adaptive based on local signals such as the rollouts from the current step or pairwise comparisons. However, these methods discard the diagnostics produced during evaluation after immediate use and prevent the long-term accumulation and strategic reuse of evaluation knowledge. This forces the system to re-derive evaluation principles from scratch, limits its ability to detect recurring suboptimal behaviors, and forfeits the curriculum-like progression that a persistent training history would naturally support. To address these limitations, we introduce AMARIS, which grounds rubric modifications in long-term training history. At each training step, AMARIS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
