DebateBench: A Challenging Long Context Reasoning Benchmark For Large Language Models
Utkarsh Tiwari, Aryan Seth, Adi Mukherjee, Kaavya Mer, Kavish, Dhruv, Kumar

TL;DR
DebateBench is a large-scale dataset of lengthy debate transcripts designed to evaluate large language models' reasoning, argumentation, and deliberation skills in long-context scenarios, revealing current models' limitations.
Contribution
The paper introduces DebateBench, a challenging long-context reasoning benchmark with detailed annotations, to assess and improve LLMs' debate and argumentation capabilities.
Findings
LLMs struggle with long-context reasoning on DebateBench.
GPT-4 and other models show limited performance on complex debate tasks.
DebateBench highlights the need for advanced techniques to enhance LLM reasoning.
Abstract
We introduce DebateBench, a novel dataset consisting of an extensive collection of transcripts and metadata from some of the world's most prestigious competitive debates. The dataset consists of British Parliamentary debates from prestigious debating tournaments on diverse topics, annotated with detailed speech-level scores and house rankings sourced from official adjudication data. We curate 256 speeches across 32 debates with each debate being over 1 hour long with each input being an average of 32,000 tokens. Designed to capture long-context, large-scale reasoning tasks, DebateBench provides a benchmark for evaluating modern large language models (LLMs) on their ability to engage in argumentation, deliberation, and alignment with human experts. To do well on DebateBench, the LLMs must perform in-context learning to understand the rules and evaluation criteria of the debates, then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Multi-Head Attention · Adam · Softmax · Dropout · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning
