Query-Based Keyphrase Extraction from Long Documents

Martin Docekal; Pavel Smrz

arXiv:2205.05391·cs.CL·May 12, 2022

Query-Based Keyphrase Extraction from Long Documents

Martin Docekal, Pavel Smrz

PDF

1 Repo

TL;DR

This paper introduces a query-based method for extracting keyphrases from long documents by chunking and maintaining global context, improving accuracy over traditional approaches that struggle with input size limits.

Contribution

It proposes a novel approach combining chunking with a query-based global context to enhance keyphrase extraction from lengthy texts using BERT.

Findings

01

Shorter context with a query outperforms longer context without a query.

02

Method effective on multiple datasets including a new large dataset.

03

System leverages pre-trained BERT for span probability estimation.

Abstract

Transformer-based architectures in natural language processing force input size limits that can be problematic when long documents need to be processed. This paper overcomes this issue for keyphrase extraction by chunking the long documents while keeping a global context as a query defining the topic for which relevant keyphrases should be extracted. The developed system employs a pre-trained BERT model and adapts it to estimate the probability that a given text span forms a keyphrase. We experimented using various context sizes on two popular datasets, Inspec and SemEval, and a large novel dataset. The presented results show that a shorter context with a query overcomes a longer one without the query on long documents.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

KNOT-FIT-BUT/QBEK
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Weight Decay · Dropout · WordPiece · Layer Normalization · Softmax · Attention Dropout · Linear Warmup With Linear Decay