Studying the Wikipedia Hyperlink Graph for Relatedness and Disambiguation
Eneko Agirre, Ander Barrena, Aitor Soroa

TL;DR
This paper investigates the Wikipedia hyperlink graph's structure and demonstrates that using the full graph with random walks significantly improves performance in word relatedness and entity disambiguation tasks, setting new state-of-the-art results.
Contribution
It provides a comprehensive analysis of Wikipedia links, showing the effectiveness of full graph usage over direct links and clarifying the limited role of categories and infoboxes in these tasks.
Findings
Full graph use outperforms direct links in relatedness and disambiguation
Non-reciprocal links negatively impact performance
Categories and infoboxes do not improve results
Abstract
Hyperlinks and other relations in Wikipedia are a extraordinary resource which is still not fully understood. In this paper we study the different types of links in Wikipedia, and contrast the use of the full graph with respect to just direct links. We apply a well-known random walk algorithm on two tasks, word relatedness and named-entity disambiguation. We show that using the full graph is more effective than just direct links by a large margin, that non-reciprocal links harm performance, and that there is no benefit from categories and infoboxes, with coherent results on both tasks. We set new state-of-the-art figures for systems based on Wikipedia links, comparable to systems exploiting several information sources and/or supervised machine learning. Our approach is open source, with instruction to reproduce results, and amenable to be integrated with complementary text-based methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWikis in Education and Collaboration · Topic Modeling · Natural Language Processing Techniques
