Methods for Generating Drift in Text Streams
Cristiano Mesquita Garcia, Alessandro Lameiras Koerich, Alceu de Souza, Britto Jr, Jean Paul Barddal

TL;DR
This paper introduces four methods to generate labeled concept drifts in textual data streams, facilitating the creation of benchmark datasets for evaluating drift detection and adaptation in machine learning models.
Contribution
It proposes novel textual drift generation techniques and evaluates their effectiveness on real-world datasets using incremental classifiers.
Findings
All methods cause performance degradation after drifts.
Incremental SVM recovers fastest in accuracy and Macro F1-Score.
Methods help in benchmarking drift detection in text streams.
Abstract
Systems and individuals produce data continuously. On the Internet, people share their knowledge, sentiments, and opinions, provide reviews about services and products, and so on. Automatically learning from these textual data can provide insights to organizations and institutions, thus preventing financial impacts, for example. To learn from textual data over time, the machine learning system must account for concept drift. Concept drift is a frequent phenomenon in real-world datasets and corresponds to changes in data distribution over time. For instance, a concept drift occurs when sentiments change or a word's meaning is adjusted over time. Although concept drift is frequent in real-world applications, benchmark datasets with labeled drifts are rare in the literature. To bridge this gap, this paper provides four textual drift generation methods to ease the production of datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries
MethodsSupport Vector Machine
