A Methodology for Empirical Analysis of LOD Datasets
Vit Novacek

TL;DR
The paper introduces CoCoE, a comprehensive methodology using complexity, coherence, and entropy measures for empirical analysis of Linked Open Data datasets, demonstrated on biomedical RDF data.
Contribution
It presents a novel, extensible framework for evaluating RDF datasets' suitability for various knowledge discovery tasks.
Findings
CoCoE effectively differentiates datasets based on complexity and informativeness.
The methodology provides insights into dataset structure and suitability for specific applications.
Application to biomedical datasets demonstrates practical utility.
Abstract
CoCoE stands for Complexity, Coherence and Entropy, and presents an extensible methodology for empirical analysis of Linked Open Data (i.e., RDF graphs). CoCoE can offer answers to questions like: Is dataset A better than B for knowledge discovery since it is more complex and informative?, Is dataset X better than Y for simple value lookups due its flatter structure?, etc. In order to address such questions, we introduce a set of well-founded measures based on complementary notions from distributional semantics, network analysis and information theory. These measures are part of a specific implementation of the CoCoE methodology that is available for download. Last but not least, we illustrate CoCoE by its application to selected biomedical RDF datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Semantic Web and Ontologies · Natural Language Processing Techniques
