The Veln(ia)s is in the Details: Evaluating LLM Judgment on Latvian and   Lithuanian Short Answer Matching

Yevhen Kostiuk; Oxana Vitman; {\L}ukasz Gaga{\l}a; Artur Kiulian

arXiv:2501.09164·cs.CL·January 17, 2025

The Veln(ia)s is in the Details: Evaluating LLM Judgment on Latvian and Lithuanian Short Answer Matching

Yevhen Kostiuk, Oxana Vitman, {\L}ukasz Gaga{\l}a, Artur Kiulian

PDF

Open Access

TL;DR

This study evaluates the ability of various large language models to accurately perform short answer matching in Latvian and Lithuanian, revealing that larger models excel while smaller models vary in performance, especially with subtle text differences.

Contribution

Introduces new datasets for Latvian and Lithuanian short answer matching and benchmarks multiple LLMs, highlighting their strengths and limitations in subtle text difference detection.

Findings

01

Larger LLMs like QWEN2.5 72b perform near-perfectly in answer matching.

02

Smaller models show variable performance, improved with few-shot learning.

03

Mistral Nemo 12b underperforms in detecting subtle text alterations.

Abstract

In this work, we address the challenge of evaluating large language models (LLMs) on the short answer matching task for Latvian and Lithuanian languages. We introduce novel datasets consisting of 502 Latvian and 690 Lithuanian question-answer pairs. For each question-answer pair, we generated matched and non-matched answers using a set of alteration rules specifically designed to introduce small but meaningful changes in the text. These generated answers serve as test cases to assess the ability of LLMs to detect subtle differences in matching of the original answers. A subset of the datasets was manually verified for quality and accuracy. Our results show that while larger LLMs, such as QWEN2.5 72b and LLaMa3.1 70b, demonstrate near-perfect performance in distinguishing matched and non-matched answers, smaller models show more variance. For instance, LLaMa3.1 8b and EuroLLM 9b…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsSparse Evolutionary Training