Keyword Extraction from Short Texts with a Text-To-Text Transfer Transformer
Piotr P\k{e}zik, Agnieszka Miko{\l}ajczyk-Bare{\l}a, Adam, Wawrzy\'nski, Bart{\l}omiej Nito\'n, Maciej Ogrodniczuk

TL;DR
This paper evaluates the plT5 model for keyword extraction from short texts, demonstrating its effectiveness on scientific abstracts and cross-domain texts, and introduces a new Polish scientific metadata corpus.
Contribution
It presents a new Polish scientific metadata corpus and demonstrates the effectiveness of the plT5 model for keyword extraction across multiple domains.
Findings
plT5kw outperforms other methods in keyword extraction
The new POSMAC corpus facilitates evaluation of keyword extraction models
plT5kw shows promise in cross-domain text labelling scenarios
Abstract
The paper explores the relevance of the Text-To-Text Transfer Transformer language model (T5) for Polish (plT5) to the task of intrinsic and extrinsic keyword extraction from short text passages. The evaluation is carried out on the new Polish Open Science Metadata Corpus (POSMAC), which is released with this paper: a collection of 216,214 abstracts of scientific publications compiled in the CURLICAT project. We compare the results obtained by four different methods, i.e. plT5kw, extremeText, TermoPL, KeyBERT and conclude that the plT5kw model yields particularly promising results for both frequent and sparsely represented keywords. Furthermore, a plT5kw keyword generation model trained on the POSMAC also seems to produce highly useful results in cross-domain text labelling scenarios. We discuss the performance of the model on news stories and phone-based dialog transcripts which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Topic Modeling · Natural Language Processing Techniques
MethodsLinear Layer · Byte Pair Encoding · Attention Is All You Need · Softmax · Dropout · Dense Connections · Residual Connection · Multi-Head Attention · Absolute Position Encodings · Position-Wise Feed-Forward Layer
