A Hash-based Co-Clustering Algorithm for Categorical Data
Fabricio Olivetti de Fran\c{c}a

TL;DR
This paper introduces a novel hash-based co-clustering algorithm for categorical data that efficiently finds meaningful clusters by leveraging Locality Sensitive Hashing, addressing challenges of feature importance and multiple cluster interpretations.
Contribution
The paper presents a new co-clustering method using Locality Sensitive Hashing to improve clustering quality and scalability for categorical data.
Findings
Capable of finding high-quality co-clusters across various datasets
Scales linearly with dataset size
Effective in handling feature importance and multiple cluster interpretations
Abstract
Many real-life data are described by categorical attributes without a pre-classification. A common data mining method used to extract information from this type of data is clustering. This method group together the samples from the data that are more similar than all other samples. But, categorical data pose a challenge when extracting information because: the calculation of two objects similarity is usually done by measuring the number of common features, but ignore a possible importance weighting; if the data may be divided differently according to different subsets of the features, the algorithm may find clusters with different meanings from each other, difficulting the post analysis. Data Co-Clustering of categorical data is the technique that tries to find subsets of samples that share a subset of features in common. By doing so, not only a sample may belong to more than one…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Face and Expression Recognition
