Parallel clustering of high-dimensional social media data streams

Xiaoming Gao; Emilio Ferrara; Judy Qiu

arXiv:1502.00316·cs.DC·March 7, 2017

Parallel clustering of high-dimensional social media data streams

Xiaoming Gao, Emilio Ferrara, Judy Qiu

PDF

TL;DR

This paper presents a parallel clustering system for high-dimensional social media streams, enabling real-time processing of Twitter data by addressing system synchronization and scalability challenges.

Contribution

It introduces a parallel clustering approach within Cloud DIKW that efficiently handles high-dimensional data and dynamic cluster updates for real-time social media stream analysis.

Findings

01

Processed Twitter 10% data stream in real-time with 96-way parallelism.

02

Proposed a synchronization method using pub-sub messaging for distributed workers.

03

Scalability projected to reach full Twitter stream processing with 1000-way parallelism.

Abstract

We introduce Cloud DIKW as an analysis environment supporting scientific discovery through integrated parallel batch and streaming processing, and apply it to one representative domain application: social media data stream clustering. Recent work demonstrated that high-quality clusters can be generated by representing the data points using high-dimensional vectors that reflect textual content and social network information. Due to the high cost of similarity computation, sequential implementations of even single-pass algorithms cannot keep up with the speed of real-world streams. This paper presents our efforts to meet the constraints of real-time social stream clustering through parallelization. We focus on two system-level issues. Most stream processing engines like Apache Storm organize distributed workers in the form of a directed acyclic graph, making it difficult to dynamically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.