Sampling Streaming Data with Parallel Vector Quantization -- PVQ
Mujahid Sultan

TL;DR
This paper introduces a vector quantization-based sampling method that reduces class imbalance in streaming data, improving the accuracy of various machine learning models in cloud-centric network traffic analysis.
Contribution
The paper presents a novel parallel vector quantization sampling technique specifically designed to address class imbalance in streaming data for cloud-based applications.
Findings
Improved classification accuracy with the proposed sampling method.
Effective reduction of class imbalance in network traffic datasets.
Compatibility with multiple ML models and hyperparameter optimization.
Abstract
Accumulation of corporate data in the cloud has attracted more enterprise applications to the cloud creating data gravity. As a consequence, network traffic has become more cloud centric. This increase in cloud centric traffic poses new challenges in designing learning systems for streaming data due to class imbalance. The number of classes plays a vital role in the accuracy of the classifiers built from the data streams. In this paper, we present a vector quantization-based sampling method, which substantially reduces the class imbalance in data streams. We demonstrate its effectiveness by conducting experiments on network traffic and anomaly dataset with commonly used ML model building methods; Multilayered Perceptron on TensorFlow backend, Support Vector Machines, K-Nearest Neighbour, and Random Forests. We built models using parallel processing, batch processing, and randomly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Anomaly Detection Techniques and Applications · Network Security and Intrusion Detection
