Disce aut Deficere: Evaluating LLMs Proficiency on the INVALSI Italian Benchmark
Fabio Mercorio, Mario Mezzanzanica, Daniele Potert\`i, Antonio Serino,, Andrea Seveso

TL;DR
This paper introduces a new benchmark using the INVALSI tests to evaluate the proficiency of large language models in Italian, providing a standardized way to assess and compare their performance against human results.
Contribution
The paper adapts the INVALSI tests for automated evaluation of LLMs, offers a detailed assessment of current models, and visually compares their performance to humans.
Findings
LLMs show varying proficiency on the INVALSI benchmark
The benchmark provides a new standardized evaluation method for Italian language models
Comparison highlights gaps between LLMs and human performance
Abstract
Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to generate and manipulate human language, highlighting their potential across various applications. Evaluating LLMs in languages other than English is crucial for ensuring their linguistic versatility, cultural relevance, and applicability in diverse global contexts, thus broadening their usability and effectiveness. We tackle this challenge by introducing a structured benchmark using the INVALSI tests, a set of well-established assessments designed to measure educational competencies across Italy. Our study makes three primary contributions: Firstly, we adapt the INVALSI benchmark for automated LLM evaluation, which involves rigorous adaptation of the test format to suit automated processing while retaining the essence of the original tests. Secondly, we provide a detailed assessment of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsManufacturing Process and Optimization · Law, AI, and Intellectual Property · Quality and Management Systems
MethodsSparse Evolutionary Training
