MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization

Yang Zhao; Hepeng Wang; Xiao Ding; Yangou Ouyang; Bibo Cai; Kai Xiong; Jinglong Gao; Zhouhao Sun; Li Du; Bing Qin; Ting Liu

arXiv:2601.07208·cs.LG·April 14, 2026

MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization

Yang Zhao, Hepeng Wang, Xiao Ding, Yangou Ouyang, Bibo Cai, Kai Xiong, Jinglong Gao, Zhouhao Sun, Li Du, Bing Qin, Ting Liu

PDF

TL;DR

MAESTRO introduces a meta-learning approach that dynamically balances conflicting objectives in reward optimization for large language models, improving performance across diverse benchmarks.

Contribution

It proposes a novel meta-cognitive layer that treats reward scalarization as a latent policy, enabling adaptive trade-offs in open-domain LLM tasks.

Findings

01

Outperforms static reward baselines on seven benchmarks.

02

Reduces redundant generation in some settings.

03

Maintains efficiency advantages of GRPO.

Abstract

Group-Relative Policy Optimization (GRPO) has emerged as an efficient paradigm for aligning Large Language Models (LLMs), yet its efficacy is primarily confined to domains with verifiable ground truths. Extending GRPO to open-domain settings remains a critical challenge, as unconstrained generation entails multi-faceted and often conflicting objectives - such as creativity versus factuality - where rigid, static reward scalarization is inherently suboptimal. To address this, we propose MAESTRO (Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization), which introduces a meta-cognitive orchestration layer that treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck to perceive task-specific priorities. We formulate this as a contextual bandit problem within a bi-level optimization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.