Unstable markup: A template-based information extraction from web sites with unstable markup
Maxim Kolchin, Fedor Kozlov

TL;DR
This paper introduces a template-based web crawling method to extract structured data from unstable web markup, linking entities to Linked Open Data sources, demonstrated on CEUR Workshop proceedings.
Contribution
The work presents an extensible template-dependent crawler that effectively extracts and links entities from web pages with unstable markup, improving semantic data extraction.
Findings
Successfully converted CEUR proceedings to LOD dataset
Linked extracted entities to DBpedia for semantic enrichment
Demonstrated robustness on pages with unstable markup
Abstract
This paper presents results of a work on crawling CEUR Workshop proceedings web site to a Linked Open Data (LOD) dataset in the framework of ESWC 2014 Semantic Publishing Challenge 2014. Our approach is based on using an extensible template-dependent crawler and DBpedia for linking extracted entities, such as the names of universities and countries.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Advanced Text Analysis Techniques · Service-Oriented Architecture and Web Services
