Semi-Supervised Cleansing of Web Argument Corpora

Jonas Dorsch; Henning Wachsmuth

arXiv:2011.01798·cs.CL·November 4, 2020·1 cites

Semi-Supervised Cleansing of Web Argument Corpora

Jonas Dorsch, Henning Wachsmuth

PDF

Open Access 1 Repo

TL;DR

This paper introduces a semi-supervised method to automatically identify and remove irrelevant or detrimental text from web argument corpora, significantly improving data quality for computational argumentation research.

Contribution

It presents a novel, precision-oriented semi-supervised approach that learns lexical patterns to detect irrelevant text, enhancing corpus cleansing with minimal manual effort.

Findings

01

Detected 87,000 irrelevant sentences with 97% precision

02

Applicable to large web argument corpora like args.me

03

Improves corpus quality for argumentation research

Abstract

Debate portals and similar web platforms constitute one of the main text sources in computational argumentation research and its applications. While the corpora built upon these sources are rich of argumentatively relevant content and structure, they also include text that is irrelevant, or even detrimental, to their purpose. In this paper, we present a precision-oriented approach to detecting such irrelevant text in a semi-supervised way. Given a few seed examples, the approach automatically learns basic lexical patterns of relevance and irrelevance and then incrementally bootstraps new patterns from sentences matching the patterns. In the existing args.me corpus with 400k argumentative texts, our approach detects almost 87k irrelevant sentences, at a precision of 0.97 according to manual evaluation. With low effort, the approach can be adapted to other web argument corpora, providing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

webis-de/ArgMining-20
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Natural Language Processing Techniques