$\textit{sentropy}$: A Python Package for Revealing Hidden Differences in Complex Datasets
Phuc Nguyen, Rohit Arora, Elliot D. Hill, Jasper Braun, Alexandra Morgan, Liza M. Quintana, Gabrielle Mazzoni, Ghee Rye Lee, Rima Arnaout, Ramy Arnaout

TL;DR
sentropy is a Python package that computes similarity-sensitive entropy measures, enabling richer dataset characterization beyond size and class balance, applicable across diverse fields like immunomics and medical imaging.
Contribution
The paper introduces sentropy, a Python tool that efficiently calculates S-entropy measures for large datasets, filling a gap in accessible tools for similarity-sensitive dataset analysis.
Findings
sentropy effectively computes S-entropy for large datasets
It demonstrates versatility across multiple scientific fields
Provides dataset comparison metrics
Abstract
Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed S-entropy (similarity-sensitive entropy), that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed , a Python package that calculates S-entropy and is tailored to large datasets. can calculate any of the frequency-sensitive measures of Hill's D-number framework and their similarity-sensitive counterparts. also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGut microbiota and health · Single-cell and spatial transcriptomics · Gene expression and cancer classification
MethodsSparse Evolutionary Training
