SinFoS: A Parallel Dataset for Translating Sinhala Figures of Speech
Johan Sofalas, Dilushri Pavithra, Nevidu Jayatilleke, Ruvan Weerasinghe

TL;DR
This paper introduces SinFoS, a new dataset of Sinhala figures of speech with annotations, to improve translation and understanding of culturally rooted expressions in low-resource NLP contexts.
Contribution
The paper presents a novel dataset of 2,344 Sinhala FoS with annotations, along with classification tools and evaluation of LLMs, addressing low-resource and cultural challenges in machine translation.
Findings
Binary classifier achieved 92% accuracy in distinguishing FoS types.
Current LLMs struggle with accurately translating idiomatic expressions.
The dataset serves as a benchmark for low-resource, culturally-aware NLP research.
Abstract
Figures of Speech (FoS) consist of multi-word phrases that are deeply intertwined with culture. While Neural Machine Translation (NMT) performs relatively well with the figurative expressions of high-resource languages, it often faces challenges when dealing with low-resource languages like Sinhala due to limited available data. To address this limitation, we introduce a corpus of 2,344 Sinhala figures of speech with cultural and cross-lingual annotations. We examine this dataset to classify the cultural origins of the figures of speech and to identify their cross-lingual equivalents. Additionally, we have developed a binary classifier to differentiate between two types of FOS in the dataset, achieving an accuracy rate of approximately 92%. We also evaluate the performance of existing LLMs on this dataset. Our findings reveal significant shortcomings in the current capabilities of LLMs,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Natural Language Processing Techniques · Multimodal Machine Learning Applications
