DOCENT: Learning Self-Supervised Entity Representations from Large Document Collections
Yury Zemlyanskiy, Sudeep Gandhe, Ruining He, Bhargav Kanagal, Anirudh, Ravula, Juraj Gottweis, Fei Sha, Ilya Eckstein

TL;DR
DOCENT introduces a self-supervised approach to learn comprehensive entity representations from large, multi-source text collections, enhancing various entity-centric tasks without human supervision.
Contribution
It proposes novel training strategies that jointly predict words and entities, scaling to large corpora and outperforming baselines in downstream tasks.
Findings
Models match or outperform baselines in downstream tasks.
Effective learning from large, multi-source text data.
Scalable to very large corpora.
Abstract
This paper explores learning rich self-supervised entity representations from large amounts of the associated text. Once pre-trained, these models become applicable to multiple entity-centric tasks such as ranked retrieval, knowledge base completion, question answering, and more. Unlike other methods that harvest self-supervision signals based merely on a local context within a sentence, we radically expand the notion of context to include any available text related to an entity. This enables a new class of powerful, high-capacity representations that can ultimately distill much of the useful information about an entity from multiple text sources, without any human supervision. We present several training strategies that, unlike prior approaches, learn to jointly predict words and entities -- strategies we compare experimentally on downstream tasks in the TV-Movies domain, such as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
