AI Act Evaluation Benchmark: An Open, Transparent, and Reproducible Evaluation Dataset for NLP and RAG Systems

Athanasios Davvetas; Michael Papademas; Xenia Ziouvelou; Vangelis Karkaletsis

arXiv:2603.09435·cs.AI·March 11, 2026

AI Act Evaluation Benchmark: An Open, Transparent, and Reproducible Evaluation Dataset for NLP and RAG Systems

Athanasios Davvetas, Michael Papademas, Xenia Ziouvelou, Vangelis Karkaletsis

PDF

Open Access

TL;DR

This paper introduces an open, transparent, and reproducible dataset for evaluating NLP and RAG systems' compliance with the EU AI Act, enabling more efficient and accurate assessments of AI risk levels and obligations.

Contribution

The authors developed a novel dataset and methodology for evaluating NLP models against EU AI Act compliance tasks, utilizing large language models for grounded scenario generation.

Findings

01

Achieved 0.87 F1-score for prohibited scenarios

02

Achieved 0.85 F1-score for high-risk scenarios

03

Demonstrated effective use of language models for grounded data generation

Abstract

The rapid rollout of AI in heterogeneous public and societal sectors has subsequently escalated the need for compliance with regulatory standards and frameworks. The EU AI Act has emerged as a landmark in the regulatory landscape. The development of solutions that elicit the level of AI systems' compliance with such standards is often limited by the lack of resources, hindering the semi-automated or automated evaluation of their performance. This generates the need for manual work, which is often error-prone, resource-limited or limited to cases not clearly described by the regulation. This paper presents an open, transparent, and reproducible method of creating a resource that facilitates the evaluation of NLP models with a strong focus on RAG systems. We have developed a dataset that contain the tasks of risk-level classification, article retrieval, obligation generation, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Artificial Intelligence in Healthcare and Education · Topic Modeling