Corpus Statistics in Text Classification of Online Data

Marina Sokolova; Victoria Bobicev

arXiv:1803.06390·cs.CL·March 20, 2018·1 cites

Corpus Statistics in Text Classification of Online Data

Marina Sokolova, Victoria Bobicev

PDF

Open Access

TL;DR

This paper investigates how corpus characteristics influence text classification results, using empirical analysis on health forum data for sentiment analysis to enhance understanding of data impact on ML performance.

Contribution

It provides an empirical analysis of corpus characteristics' effects on text classification accuracy in online health forum data.

Findings

01

Corpus features correlate with classification performance

02

Insights into data-driven improvements for sentiment analysis

03

Empirical evidence on corpus influence in real-world datasets

Abstract

Transformation of Machine Learning (ML) from a boutique science to a generally accepted technology has increased importance of reproduction and transportability of ML studies. In the current work, we investigate how corpus characteristics of textual data sets correspond to text classification results. We work with two data sets gathered from sub-forums of an online health-related forum. Our empirical results are obtained for a multi-class sentiment analysis application.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques · Sentiment Analysis and Opinion Mining · Spam and Phishing Detection