A comprehensive study of LLM-based argument classification: from Llama through DeepSeek to GPT-5.2

Marcin Pietro\'n; Filip Gampel; Jakub Gomu{\l}ka; Andrzej Tomski; Rafa{\l} Olszowski

arXiv:2603.19253·cs.CL·March 23, 2026

A comprehensive study of LLM-based argument classification: from Llama through DeepSeek to GPT-5.2

Marcin Pietro\'n, Filip Gampel, Jakub Gomu{\l}ka, Andrzej Tomski, Rafa{\l} Olszowski

PDF

Open Access

TL;DR

This paper evaluates state-of-the-art large language models like GPT-5.2, Llama, and DeepSeek on argument classification tasks, demonstrating how advanced prompting strategies improve performance and revealing systematic challenges in argument understanding.

Contribution

It provides the first comprehensive benchmarking and qualitative analysis of LLMs on argument classification with advanced prompts across multiple datasets.

Findings

01

GPT-5.2 achieves up to 78% accuracy on UKP and 92% on Args.me.

02

Prompt rephrasing and voting improve accuracy by 2-8%.

03

Models share failure modes like prompt sensitivity and difficulty with implicit criticism.

Abstract

Argument mining (AM) is an interdisciplinary research field focused on the automatic identification and classification of argumentative components, such as claims and premises, and the relationships between them. Recent advances in large language models (LLMs) have significantly improved the performance of argument classification compared to traditional machine learning approaches. This study presents a comprehensive evaluation of several state-of-the-art LLMs, including GPT-5.2, Llama 4, and DeepSeek, on large publicly available argument classification corpora such as Args.me and UKP. The evaluation incorporates advanced prompting strategies, including Chain-of- Thought prompting, prompt rephrasing, voting, and certainty-based classification. Both quantitative performance metrics and qualitative error analysis are conducted to assess model behavior. The best-performing model in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multi-Agent Systems and Negotiation · Sentiment Analysis and Opinion Mining