Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate

Chenxi Liu; Yanshuo Chen; Ruibo Chen; Tianyi Xiong; Tong Zheng; Heng Huang

arXiv:2601.22297·cs.CL·May 19, 2026

Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate

Chenxi Liu, Yanshuo Chen, Ruibo Chen, Tianyi Xiong, Tong Zheng, Heng Huang

PDF

1 Models

TL;DR

This paper introduces Self-Debate Reinforcement Learning (SDRL), a training framework that enhances large language models by enabling them to learn from self-debate, improving both standalone reasoning and multi-agent debate performance.

Contribution

SDRL is a novel training method that prepares models for multi-agent debate by jointly optimizing for standalone and debate-conditioned reasoning capabilities.

Findings

01

SDRL improves multi-agent debate performance across various protocols.

02

SDRL enhances single-model reasoning abilities.

03

Experiments show consistent gains across multiple models and benchmarks.

Abstract

The reasoning abilities of large language models (LLMs) have been substantially improved by reinforcement learning with verifiable rewards (RLVR). At test time, collaborative reasoning through Multi-Agent Debate (MAD) has emerged as a promising approach for enhancing LLM performance. However, current RLVR methods typically train LLMs to solve problems in isolation, without explicitly preparing them to synthesize and benefit from different rationales that arise during debate. In this work, we propose Self-Debate Reinforcement Learning(SDRL), a training framework where models learn from self-debate, equipping a single LLM with both strong standalone problem-solving ability and the capability to process diverse reasoning trajectories in MAD. Given a prompt, SDRL first samples multiple candidate solutions, then constructs a debate context with diverse reasoning paths and generates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
valendra/qwen3.5-4b-demon-angel
model· 74 dl· ♡ 1
74 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Topic Modeling · Hate Speech and Cyberbullying Detection