Disce aut Deficere: Evaluating LLMs Proficiency on the INVALSI Italian   Benchmark

Fabio Mercorio; Mario Mezzanzanica; Daniele Potert\`i; Antonio Serino,; Andrea Seveso

arXiv:2406.17535·cs.CL·June 26, 2024·1 cites

Disce aut Deficere: Evaluating LLMs Proficiency on the INVALSI Italian Benchmark

Fabio Mercorio, Mario Mezzanzanica, Daniele Potert\`i, Antonio Serino,, Andrea Seveso

PDF

Open Access

TL;DR

This paper introduces a new benchmark using the INVALSI tests to evaluate the proficiency of large language models in Italian, providing a standardized way to assess and compare their performance against human results.

Contribution

The paper adapts the INVALSI tests for automated evaluation of LLMs, offers a detailed assessment of current models, and visually compares their performance to humans.

Findings

01

LLMs show varying proficiency on the INVALSI benchmark

02

The benchmark provides a new standardized evaluation method for Italian language models

03

Comparison highlights gaps between LLMs and human performance

Abstract

Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to generate and manipulate human language, highlighting their potential across various applications. Evaluating LLMs in languages other than English is crucial for ensuring their linguistic versatility, cultural relevance, and applicability in diverse global contexts, thus broadening their usability and effectiveness. We tackle this challenge by introducing a structured benchmark using the INVALSI tests, a set of well-established assessments designed to measure educational competencies across Italy. Our study makes three primary contributions: Firstly, we adapt the INVALSI benchmark for automated LLM evaluation, which involves rigorous adaptation of the test format to suit automated processing while retaining the essence of the original tests. Secondly, we provide a detailed assessment of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsManufacturing Process and Optimization · Law, AI, and Intellectual Property · Quality and Management Systems

MethodsSparse Evolutionary Training