A Framework for Fast Polarity Labelling of Massive Data Streams
Huilin Wu, Mian Lu, Zhao Zheng, Shuhao Zhang

TL;DR
PLStream is a new Apache Flink-based framework that rapidly labels the sentiment polarity of massive, high-speed data streams like social media and reviews, achieving high accuracy without manual effort.
Contribution
The paper introduces PLStream, a novel system combining algorithmic and system optimizations for fast, high-quality sentiment labeling of unlabelled data streams.
Findings
Achieves nearly 80% accuracy in polarity labeling.
Handles data streams at speeds up to 16,000 tuples/sec.
Operates without manual annotation efforts.
Abstract
Many of the existing sentiment analysis techniques are based on supervised learning, and they demand the availability of valuable training datasets to train their models. When dataset freshness is critical, the annotating of high speed unlabelled data streams becomes critical but remains an open problem. In this paper, we propose PLStream, a novel Apache Flink-based framework for fast polarity labelling of massive data streams, like Twitter tweets or online product reviews. We address the associated implementation challenges and propose a list of techniques including both algorithmic improvements and system optimizations. A thorough empirical validation with two real-world workloads demonstrates that PLStream is able to generate high quality labels (almost 80% accuracy) in the presence of high-speed continuous unlabelled data streams (almost 16,000 tuples/sec) without any manual efforts.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Spam and Phishing Detection · Web Data Mining and Analysis
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
