TL;DR
This paper introduces InspireScore and InspireDebate, a comprehensive framework for evaluating and optimizing debating systems using multi-dimensional assessments and phased reasoning improvements, leading to substantial performance gains.
Contribution
It presents a novel multi-dimensional evaluation system and an optimized debating framework that jointly enhance debate quality and assessment accuracy.
Findings
InspireScore achieves 44% higher correlation with expert judgments.
InspireDebate outperforms baselines by 57%.
The framework improves debate quality across multiple criteria.
Abstract
With the rapid advancements in large language models (LLMs), debating tasks, such as argument quality assessment and debate process simulation, have made significant progress. However, existing LLM-based debating systems focus on responding to specific arguments while neglecting objective assessments such as authenticity and logical validity. Furthermore, these systems lack a structured approach to optimize across various dimensionsincluding evaluation metrics, chain-of-thought (CoT) reasoning, and multi-turn debate refinementthereby limiting their effectiveness. To address these interconnected challenges, we propose a dual-component framework: (1) , a novel evaluation system that establishes a multi-dimensional assessment architecture incorporating four subjective criteria (emotional appeal, argument clarity, argument arrangement, and topic relevance)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
