SeDa: A Unified System for Dataset Discovery and Multi-Entity Augmented Semantic Exploration
Kan Ling, Zhen Qin, Yichi Zhu, Hengrun Zhang, Huiqun Yu, Guisheng Fan

TL;DR
SeDa is a comprehensive system that unifies dataset discovery, semantic annotation, and multi-entity navigation across millions of datasets, improving cross-source data exploration and reliability.
Contribution
SeDa introduces an integrated framework combining semantic extraction, tag graph construction, and multi-entity navigation for enhanced dataset discovery and trustworthiness.
Findings
Outperforms existing dataset search platforms in coverage and timeliness.
Ensures dataset source reliability through provenance validation.
Enables context-aware exploration beyond traditional search methods.
Abstract
The continuous expansion of open data platforms and research repositories has led to a fragmented dataset ecosystem, posing significant challenges for cross-source data discovery and interpretation. To address these challenges, we introduce SeDa--a unified framework for dataset discovery, semantic annotation, and multi-entity augmented navigation. SeDa integrates more than 7.6 million datasets from over 200 platforms, spanning governmental, academic, and industrial domains. The framework first performs semantic extraction and standardization to harmonize heterogeneous metadata representations. On this basis, a topic-tagging mechanism constructs an extensible tag graph that supports thematic retrieval and cross-domain association, while a provenance assurance module embedded within the annotation process continuously validates dataset sources and monitors link availability to ensure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Data Quality and Management · Research Data Management Practices
