Open or Closed LLM for Lesser-Resourced Languages? Lessons from Greek
John Pavlopoulos, Juli Bakagianni, Kanella Pouli, Maria Gavriilidou

TL;DR
This paper evaluates open and closed-source large language models on Greek NLP tasks, explores ethical data usage implications, and demonstrates a legal NLP case study with a novel methodology, advancing NLP for lesser-resourced languages.
Contribution
It provides a comprehensive evaluation of LLMs for Greek NLP, introduces a new data provenance assessment method, and presents a superior legal text processing approach.
Findings
Open-source and closed-source LLMs show task-specific strengths and weaknesses.
High 0-shot accuracy in authorship attribution indicates potential data usage by LLMs.
STE methodology outperforms TF-IDF in legal text clustering.
Abstract
Natural Language Processing (NLP) for lesser-resourced languages faces persistent challenges, including limited datasets, inherited biases from high-resource languages, and the need for domain-specific solutions. This study addresses these gaps for Modern Greek through three key contributions. First, we evaluate the performance of open-source (Llama-70b) and closed-source (GPT-4o mini) large language models (LLMs) on seven core NLP tasks with dataset availability, revealing task-specific strengths, weaknesses, and parity in their performance. Second, we expand the scope of Greek NLP by reframing Authorship Attribution as a tool to assess potential data usage by LLMs in pre-training, with high 0-shot accuracy suggesting ethical implications for data provenance. Third, we showcase a legal NLP case study, where a Summarize, Translate, and Embed (STE) methodology outperforms the traditional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Library Science and Information Systems
