MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization
Yang Zhao, Hepeng Wang, Xiao Ding, Yangou Ouyang, Bibo Cai, Kai Xiong, Jinglong Gao, Zhouhao Sun, Li Du, Bing Qin, Ting Liu

TL;DR
MAESTRO introduces a meta-learning approach that dynamically balances conflicting objectives in reward optimization for large language models, improving performance across diverse benchmarks.
Contribution
It proposes a novel meta-cognitive layer that treats reward scalarization as a latent policy, enabling adaptive trade-offs in open-domain LLM tasks.
Findings
Outperforms static reward baselines on seven benchmarks.
Reduces redundant generation in some settings.
Maintains efficiency advantages of GRPO.
Abstract
Group-Relative Policy Optimization (GRPO) has emerged as an efficient paradigm for aligning Large Language Models (LLMs), yet its efficacy is primarily confined to domains with verifiable ground truths. Extending GRPO to open-domain settings remains a critical challenge, as unconstrained generation entails multi-faceted and often conflicting objectives - such as creativity versus factuality - where rigid, static reward scalarization is inherently suboptimal. To address this, we propose MAESTRO (Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization), which introduces a meta-cognitive orchestration layer that treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck to perceive task-specific priorities. We formulate this as a contextual bandit problem within a bi-level optimization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
