Icelandic Parallel Abstracts Corpus
Haukur Barri S\'imonarson, V\'esteinn Sn{\ae}bjarnarson

TL;DR
This paper introduces the Icelandic Parallel Abstracts Corpus (IPAC), a new bilingual dataset of 64,000 sentence pairs from Icelandic and English academic abstracts, created for translation and NLP research.
Contribution
The paper presents a novel, large-scale Icelandic-English parallel corpus derived from academic abstracts, aligned using BLEU scores and suitable for machine translation tasks.
Findings
Corpus contains 64,000 sentence pairs
Aligned using BLEU-based sentence alignment
Applicable for translation and NLP research
Abstract
We present a new Icelandic-English parallel corpus, the Icelandic Parallel Abstracts Corpus (IPAC), composed of abstracts from student theses and dissertations. The texts were collected from the Skemman repository which keeps records of all theses, dissertations and final projects from students at Icelandic universities. The corpus was aligned based on sentence-level BLEU scores, in both translation directions, from NMT models using Bleualign. The result is a corpus of 64k sentence pairs from over 6 thousand parallel abstracts.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
