ARCANE: A Multi-Agent Framework for Interpretable and Configurable Alignment
Charlie Masters, Marta Grze\'skiewicz, Stefano V. Albrecht

TL;DR
ARCANE introduces a multi-agent framework that uses natural-language rubrics for interpretable, adaptable reward modeling, enabling better alignment of large language models with stakeholder preferences in complex tasks.
Contribution
The paper presents a novel multi-agent approach that dynamically generates interpretable rubrics for reward modeling, allowing real-time preference shifts without retraining.
Findings
Rubrics are compact and legible, aiding interpretability.
Configurable trade-offs are achievable without retraining.
Rubric-based reward models improve alignment in complex tasks.
Abstract
As agents based on large language models are increasingly deployed to long-horizon tasks, maintaining their alignment with stakeholder preferences becomes critical. Effective alignment in such settings requires reward models that are interpretable so that stakeholders can understand and audit model objectives. Moreover, reward models must be capable of steering agents at interaction time, allowing preference shifts to be incorporated without retraining. We introduce ARCANE, a framework that frames alignment as a multi-agent collaboration problem that dynamically represents stakeholder preferences as natural-language rubrics: weighted sets of verifiable criteria that can be generated on-the-fly from task context. Inspired by utility theory, we formulate rubric learning as a reconstruction problem and apply a regularized Group-Sequence Policy Optimization (GSPO) procedure that balances…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Multimodal Machine Learning Applications
