Metadata Shaping: Natural Language Annotations for the Tail

Simran Arora; Sen Wu; Enci Liu; Christopher Re

arXiv:2110.08430·cs.CL·October 19, 2021

Metadata Shaping: Natural Language Annotations for the Tail

Simran Arora, Sen Wu, Enci Liu, Christopher Re

PDF

Open Access 1 Repo

TL;DR

Metadata shaping enhances language models' ability to understand rare entities by appending metadata to training examples, achieving significant improvements without altering the model architecture.

Contribution

The paper introduces metadata shaping, a data-centric method that improves LM performance on rare entities by leveraging metadata, matching or surpassing existing architecture-based approaches.

Findings

01

Metadata shaping improves F1 scores by up to 5.3 points.

02

The method yields up to 10x larger gains on tail entity examples.

03

It achieves or exceeds state-of-the-art results on standard tasks.

Abstract

Language models (LMs) have made remarkable progress, but still struggle to generalize beyond the training data to rare linguistic patterns. Since rare entities and facts are prevalent in the queries users submit to popular applications such as search and personal assistant systems, improving the ability of LMs to reliably capture knowledge over rare entities is a pressing challenge studied in significant prior work. Noticing that existing approaches primarily modify the LM architecture or introduce auxiliary objectives to inject useful entity knowledge, we ask to what extent we could match the quality of these architectures using a base LM architecture, and only changing the data? We propose metadata shaping, a method in which readily available metadata, such as entity descriptions and categorical tags, are appended to examples based on information theoretic metrics. Intuitively, if…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

simran-arora/metadatashaping
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems