$\textit{lucie}$: An Improved Python Package for Loading Datasets from the UCI Machine Learning Repository
Kenneth Ge, Phuc Nguyen, Ramy Arnaout

TL;DR
lucie is a Python package that significantly improves the importability of datasets from the UCI ML Repository by handling nonstandard formats, achieving a success rate of 95.4%.
Contribution
It introduces an automated utility that enhances dataset import success from UCI, addressing limitations of existing tools.
Findings
95.4% success rate on top datasets
Outperforms ucimlrepo's 73.1% success rate
Available as a Python package with high code coverage
Abstract
The University of California--Irvine (UCI) Machine Learning (ML) Repository (UCIMLR) is consistently cited as one of the most popular dataset repositories, hosting hundreds of high-impact datasets. However, a significant portion, including 28.4% of the top 250, cannot be imported via the package that is provided and recommended by the UCIMLR website. Instead, they are hosted as .zip files, containing nonstandard formats that are difficult to import without additional ad hoc processing. To address this issue, here we present -- -- a utility that automatically determines the data format and imports many of these previously non-importable datasets, while preserving as much of a tabular data structure as possible. was…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications
MethodsHigh-Order Consensuses
