A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting
Muhammad Imran, Sanjay Chawla, Carlos Castillo

TL;DR
This paper introduces a robust framework using an Expert-Machine-Crowd triad to dynamically classify social media data streams, effectively detecting new categories and errors in real-time.
Contribution
It presents a novel optimization-based approach, COD-Means, unifying clustering and outlier detection for evolving data streams in a mixed human-machine setting.
Findings
Effective detection of novel categories in social media streams
Improved accuracy by identifying human annotation errors
Efficient algorithm suitable for real-time data processing
Abstract
An emerging challenge in the online classification of social media data streams is to keep the categories used for classification up-to-date. In this paper, we propose an innovative framework based on an Expert-Machine-Crowd (EMC) triad to help categorize items by continuously identifying novel concepts in heterogeneous data streams often riddled with outliers. We unify constrained clustering and outlier detection by formulating a novel optimization problem: COD-Means. We design an algorithm to solve the COD-Means problem and show that COD-Means will not only help detect novel categories but also seamlessly discover human annotation errors and improve the overall quality of the categorization process. Experiments on diverse real data sets demonstrate that our approach is both effective and efficient.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
