Parallel clustering of high-dimensional social media data streams
Xiaoming Gao, Emilio Ferrara, Judy Qiu

TL;DR
This paper presents a parallel clustering system for high-dimensional social media streams, enabling real-time processing of Twitter data by addressing system synchronization and scalability challenges.
Contribution
It introduces a parallel clustering approach within Cloud DIKW that efficiently handles high-dimensional data and dynamic cluster updates for real-time social media stream analysis.
Findings
Processed Twitter 10% data stream in real-time with 96-way parallelism.
Proposed a synchronization method using pub-sub messaging for distributed workers.
Scalability projected to reach full Twitter stream processing with 1000-way parallelism.
Abstract
We introduce Cloud DIKW as an analysis environment supporting scientific discovery through integrated parallel batch and streaming processing, and apply it to one representative domain application: social media data stream clustering. Recent work demonstrated that high-quality clusters can be generated by representing the data points using high-dimensional vectors that reflect textual content and social network information. Due to the high cost of similarity computation, sequential implementations of even single-pass algorithms cannot keep up with the speed of real-world streams. This paper presents our efforts to meet the constraints of real-time social stream clustering through parallelization. We focus on two system-level issues. Most stream processing engines like Apache Storm organize distributed workers in the form of a directed acyclic graph, making it difficult to dynamically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
