TL;DR
This paper introduces a two-stage fine-tuning method to internalize multi-agent debate in large language models, significantly reducing computational costs while maintaining or improving reasoning performance.
Contribution
It presents a novel framework for distilling multi-agent debate into a single LLM, enabling efficient internalized reasoning and interpretability through activation steering.
Findings
Internalized models match or outperform explicit debate with 93% fewer tokens.
Activation analysis reveals agent-specific subspaces corresponding to different perspectives.
Distillation facilitates easier localization and control of harmful behaviors.
Abstract
Multi-agent debate has been shown to improve reasoning in large language models (LLMs). However, it is compute-intensive, requiring generation of long transcripts before answering questions. To address this inefficiency, we develop a framework that distills multi-agent debate into a single LLM through a two-stage fine-tuning pipeline combining debate structure learning with internalization via dynamic reward scheduling and length clipping. Across multiple models and benchmarks, our internalized models match or exceed explicit multi-agent debate performance using up to 93% fewer tokens. We then investigate the mechanistic basis of this capability through activation steering, finding that internalization creates agent-specific subspaces: interpretable directions in activation space corresponding to different agent perspectives. We further demonstrate a practical application: by instilling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
