Punctuation Prediction for Polish Texts using Transformers

Jakub Pokrywka

arXiv:2410.04621·cs.CL·October 8, 2024

Punctuation Prediction for Polish Texts using Transformers

Jakub Pokrywka

PDF

TL;DR

This paper presents a punctuation prediction method for Polish texts using a fine-tuned HerBERT model, improving text readability by restoring punctuation in speech recognition outputs.

Contribution

The paper introduces a HerBERT-based approach combined with external data for punctuation prediction in Polish, achieving competitive results in Poleval 2022.

Findings

01

Achieved 71.44 Weighted F1 score

02

Utilized a single HerBERT model with external data

03

Demonstrated effectiveness for Polish punctuation prediction

Abstract

Speech recognition systems typically output text lacking punctuation. However, punctuation is crucial for written text comprehension. To tackle this problem, Punctuation Prediction models are developed. This paper describes a solution for Poleval 2022 Task 1: Punctuation Prediction for Polish Texts, which scores 71.44 Weighted F1. The method utilizes a single HerBERT model finetuned to the competition data and an external dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.