Inferring Missing Categorical Information in Noisy and Sparse Web Markup

Nicolas Tempelmeier; Elena Demidova; Stefan Dietze

arXiv:1803.00446·cs.LG·March 2, 2018

Inferring Missing Categorical Information in Noisy and Sparse Web Markup

Nicolas Tempelmeier, Elena Demidova, Stefan Dietze

PDF

TL;DR

This paper presents a supervised method to infer missing categorical information in noisy, sparse web markup data, significantly improving the accuracy of property prediction for events and movies.

Contribution

It introduces a novel supervised approach specifically designed for inferring missing categorical properties in web markup, outperforming existing methods.

Findings

01

Achieved 79% F1 score for event properties.

02

Achieved 83% F1 score for movie properties.

03

Significantly outperforms existing baseline methods.

Abstract

Embedded markup of Web pages has seen widespread adoption throughout the past years driven by standards such as RDFa and Microdata and initiatives such as schema.org, where recent studies show an adoption by 39% of all Web pages already in 2016. While this constitutes an important information source for tasks such as Web search, Web page classification or knowledge graph augmentation, individual markup nodes are usually sparsely described and often lack essential information. For instance, from 26 million nodes describing events within the Common Crawl in 2016, 59% of nodes provide less than six statements and only 257,000 nodes (0.96%) are typed with more specific event subtypes. Nevertheless, given the scale and diversity of Web markup data, nodes that provide missing information can be obtained from the Web in large quantities, in particular for categorical properties. Such data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.