Generating Multidimensional Clusters With Support Lines
Nuno Fachada, Diogo de Andrade

TL;DR
This paper introduces Clugen, a versatile open-source tool for generating synthetic multidimensional clusters supported by lines, aiding in the evaluation and development of clustering algorithms across various research contexts.
Contribution
The paper presents Clugen, a novel, modular, and open-source synthetic data generator capable of creating complex multidimensional clusters supported by line segments, adaptable across multiple programming environments.
Findings
Produces diverse multidimensional clusters
Suitable for clustering algorithm assessment
Widely applicable in clustering research
Abstract
Synthetic data is essential for assessing clustering techniques, complementing and extending real data, and allowing for more complete coverage of a given problem's space. In turn, synthetic data generators have the potential of creating vast amounts of data -- a crucial activity when real-world data is at premium -- while providing a well-understood generation procedure and an interpretable instrument for methodically investigating cluster analysis algorithms. Here, we present Clugen, a modular procedure for synthetic data generation, capable of creating multidimensional clusters supported by line segments using arbitrary distributions. Clugen is open source, comprehensively unit tested and documented, and is available for the Python, R, Julia, and MATLAB/Octave ecosystems. We demonstrate that our proposal can produce rich and varied results in various dimensions, is fit for use in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Data Visualization and Analytics · Data Stream Mining Techniques
