PiC: A Phrase-in-Context Dataset for Phrase Understanding and Semantic   Search

Thang M. Pham; Seunghyun Yoon; Trung Bui; Anh Nguyen

arXiv:2207.09068·cs.CL·February 3, 2023·1 cites

PiC: A Phrase-in-Context Dataset for Phrase Understanding and Semantic Search

Thang M. Pham, Seunghyun Yoon, Trung Bui, Anh Nguyen

PDF

Open Access 1 Repo 3 Datasets

TL;DR

This paper introduces PiC, a large dataset for phrase understanding in context, enabling improved training and evaluation of phrase embeddings and semantic search models.

Contribution

It provides a human-annotated dataset and tasks for contextual phrase embeddings, significantly advancing semantic search and phrase understanding research.

Findings

01

Training on PiC enhances ranking model accuracy.

02

Span-selection models achieve near-human accuracy (~95% EM).

03

Models struggle to distinguish phrase senses and measure phrase similarity in context.

Abstract

While contextualized word embeddings have been a de-facto standard, learning contextualized phrase embeddings is less explored and being hindered by the lack of a human-annotated benchmark that tests machine understanding of phrase semantics given a context sentence or paragraph (instead of phrases alone). To fill this gap, we propose PiC -- a dataset of ~28K of noun phrases accompanied by their contextual Wikipedia pages and a suite of three tasks for training and evaluating phrase embeddings. Training on PiC improves ranking models' accuracy and remarkably pushes span-selection (SS) models (i.e., predicting the start and end index of the target phrase) near-human accuracy, which is 95% Exact Match (EM) on semantic search given a query phrase and a passage. Interestingly, we find evidence that such impressive performance is because the SS models learn to better capture the common…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Phrase-in-Context/eval
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Attention Dropout · Layer Normalization · Weight Decay · Linear Warmup With Linear Decay · Dense Connections · Softmax