ARCANE: A Multi-Agent Framework for Interpretable and Configurable Alignment

Charlie Masters; Marta Grze\'skiewicz; Stefano V. Albrecht

arXiv:2512.06196·cs.AI·December 9, 2025

ARCANE: A Multi-Agent Framework for Interpretable and Configurable Alignment

Charlie Masters, Marta Grze\'skiewicz, Stefano V. Albrecht

PDF

Open Access 1 Video

TL;DR

ARCANE introduces a multi-agent framework that uses natural-language rubrics for interpretable, adaptable reward modeling, enabling better alignment of large language models with stakeholder preferences in complex tasks.

Contribution

The paper presents a novel multi-agent approach that dynamically generates interpretable rubrics for reward modeling, allowing real-time preference shifts without retraining.

Findings

01

Rubrics are compact and legible, aiding interpretability.

02

Configurable trade-offs are achievable without retraining.

03

Rubric-based reward models improve alignment in complex tasks.

Abstract

As agents based on large language models are increasingly deployed to long-horizon tasks, maintaining their alignment with stakeholder preferences becomes critical. Effective alignment in such settings requires reward models that are interpretable so that stakeholders can understand and audit model objectives. Moreover, reward models must be capable of steering agents at interaction time, allowing preference shifts to be incorporated without retraining. We introduce ARCANE, a framework that frames alignment as a multi-agent collaboration problem that dynamically represents stakeholder preferences as natural-language rubrics: weighted sets of verifiable criteria that can be generated on-the-fly from task context. Inspired by utility theory, we formulate rubric learning as a reconstruction problem and apply a regularized Group-Sequence Policy Optimization (GSPO) procedure that balances…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ARCANE: A Multi-Agent Framework for Interpretable and Configurable Alignment· underline

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Multimodal Machine Learning Applications