Limited Generalizability in Argument Mining: State-Of-The-Art Models Learn Datasets, Not Arguments
Marc Feger, Katarina Boland, Stefan Dietze

TL;DR
This paper critically evaluates state-of-the-art argument mining models, revealing they often rely on dataset-specific cues and struggle to generalize across different datasets, despite strong benchmark performance.
Contribution
It provides the first large-scale re-evaluation of transformer models for argument mining, highlighting issues of dataset reliance and proposing methods to improve generalization.
Findings
Models rely on lexical shortcuts tied to content words.
Performance drops significantly on unseen datasets.
Task-specific pre-training improves robustness and generalization.
Abstract
Identifying arguments is a necessary prerequisite for various tasks in automated discourse analysis, particularly within contexts such as political debates, online discussions, and scientific reasoning. In addition to theoretical advances in understanding the constitution of arguments, a significant body of research has emerged around practical argument mining, supported by a growing number of publicly available datasets. On these benchmarks, BERT-like transformers have consistently performed best, reinforcing the belief that such models are broadly applicable across diverse contexts of debate. This study offers the first large-scale re-evaluation of such state-of-the-art models, with a specific focus on their ability to generalize in identifying arguments. We evaluate four transformers, three standard and one enhanced with contrastive pre-training for better generalization, on 17…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Multi-Agent Systems and Negotiation
