Icelandic Parallel Abstracts Corpus

Haukur Barri S\'imonarson; V\'esteinn Sn{\ae}bjarnarson

arXiv:2108.05289·cs.CL·August 12, 2021·1 cites

Icelandic Parallel Abstracts Corpus

Haukur Barri S\'imonarson, V\'esteinn Sn{\ae}bjarnarson

PDF

Open Access 1 Datasets

TL;DR

This paper introduces the Icelandic Parallel Abstracts Corpus (IPAC), a new bilingual dataset of 64,000 sentence pairs from Icelandic and English academic abstracts, created for translation and NLP research.

Contribution

The paper presents a novel, large-scale Icelandic-English parallel corpus derived from academic abstracts, aligned using BLEU scores and suitable for machine translation tasks.

Findings

01

Corpus contains 64,000 sentence pairs

02

Aligned using BLEU-based sentence alignment

03

Applicable for translation and NLP research

Abstract

We present a new Icelandic-English parallel corpus, the Icelandic Parallel Abstracts Corpus (IPAC), composed of abstracts from student theses and dissertations. The texts were collected from the Skemman repository which keeps records of all theses, dissertations and final projects from students at Icelandic universities. The corpus was aligned based on sentence-level BLEU scores, in both translation directions, from NMT models using Bleualign. The result is a corpus of 64k sentence pairs from over 6 thousand parallel abstracts.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

vesteinn/icelandic-parallel-abstracts-corpus-IPAC
dataset· 20 dl
20 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems