Corpus Statistics in Text Classification of Online Data
Marina Sokolova, Victoria Bobicev

TL;DR
This paper investigates how corpus characteristics influence text classification results, using empirical analysis on health forum data for sentiment analysis to enhance understanding of data impact on ML performance.
Contribution
It provides an empirical analysis of corpus characteristics' effects on text classification accuracy in online health forum data.
Findings
Corpus features correlate with classification performance
Insights into data-driven improvements for sentiment analysis
Empirical evidence on corpus influence in real-world datasets
Abstract
Transformation of Machine Learning (ML) from a boutique science to a generally accepted technology has increased importance of reproduction and transportability of ML studies. In the current work, we investigate how corpus characteristics of textual data sets correspond to text classification results. We work with two data sets gathered from sub-forums of an online health-related forum. Our empirical results are obtained for a multi-class sentiment analysis application.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Sentiment Analysis and Opinion Mining · Spam and Phishing Detection
