Large-Scale Learning from Data Streams with Apache SAMOA
Nicolas Kourtellis, Gianmarco De Francisci Morales, and Albert Bifet

TL;DR
Apache SAMOA is an open-source platform that enables scalable, distributed data stream mining for big data, supporting various algorithms and compatible with multiple stream processing engines.
Contribution
It introduces a flexible, pluggable architecture for distributed streaming algorithms, facilitating large-scale data mining from data streams.
Findings
Supports classification, clustering, regression tasks
Compatible with Apache Flink, Storm, Samza
Open-source and extensible platform
Abstract
Apache SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams. Big data is defined as datasets whose size is beyond the ability of typical software tools to capture, store, manage, and analyze, due to the time and memory complexity. Apache SAMOA provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Apache Flink, Apache Storm, and Apache Samza. Apache SAMOA is written in Java and is available at https://samoa.incubator.apache.org under the Apache Software License version 2.0.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Machine Learning and Data Classification · Air Quality Monitoring and Forecasting
