A compendium of data sources for data science, machine learning, and artificial intelligence
Paul Bilokon, Oleksandr Bilokon, Saeed Amen

TL;DR
This paper provides a broad, though incomplete, compilation of diverse data sources across multiple domains to support data scientists and machine learning practitioners in accessing relevant data for their applications.
Contribution
It offers a comprehensive list of data sources across various fields, aiding practitioners in identifying relevant datasets for their projects.
Findings
Compiled data sources across multiple application domains.
Highlights the diversity and scope of available data for AI and ML.
Serves as a reference for data practitioners and researchers.
Abstract
Recent advances in data science, machine learning, and artificial intelligence, such as the emergence of large language models, are leading to an increasing demand for data that can be processed by such models. While data sources are application-specific, and it is impossible to produce an exhaustive list of such data sources, it seems that a comprehensive, rather than complete, list would still benefit data scientists and machine learning experts of all levels of seniority. The goal of this publication is to provide just such an (inevitably incomplete) list -- or compendium -- of data sources across multiple areas of applications, including finance and economics, legal (laws and regulations), life sciences (medicine and drug discovery), news sentiment and social media, retail and ecommerce, satellite imagery, and shipping and logistics, and sports.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Big Data Technologies and Applications
