Unstable markup: A template-based information extraction from web sites   with unstable markup

Maxim Kolchin; Fedor Kozlov

arXiv:1408.1260·cs.IR·August 7, 2014·1 cites

Unstable markup: A template-based information extraction from web sites with unstable markup

Maxim Kolchin, Fedor Kozlov

PDF

Open Access

TL;DR

This paper introduces a template-based web crawling method to extract structured data from unstable web markup, linking entities to Linked Open Data sources, demonstrated on CEUR Workshop proceedings.

Contribution

The work presents an extensible template-dependent crawler that effectively extracts and links entities from web pages with unstable markup, improving semantic data extraction.

Findings

01

Successfully converted CEUR proceedings to LOD dataset

02

Linked extracted entities to DBpedia for semantic enrichment

03

Demonstrated robustness on pages with unstable markup

Abstract

This paper presents results of a work on crawling CEUR Workshop proceedings web site to a Linked Open Data (LOD) dataset in the framework of ESWC 2014 Semantic Publishing Challenge 2014. Our approach is based on using an extensible template-dependent crawler and DBpedia for linking extracted entities, such as the names of universities and countries.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Advanced Text Analysis Techniques · Service-Oriented Architecture and Web Services