AI Act Evaluation Benchmark: An Open, Transparent, and Reproducible Evaluation Dataset for NLP and RAG Systems
Athanasios Davvetas, Michael Papademas, Xenia Ziouvelou, Vangelis Karkaletsis

TL;DR
This paper introduces an open, transparent, and reproducible dataset for evaluating NLP and RAG systems' compliance with the EU AI Act, enabling more efficient and accurate assessments of AI risk levels and obligations.
Contribution
The authors developed a novel dataset and methodology for evaluating NLP models against EU AI Act compliance tasks, utilizing large language models for grounded scenario generation.
Findings
Achieved 0.87 F1-score for prohibited scenarios
Achieved 0.85 F1-score for high-risk scenarios
Demonstrated effective use of language models for grounded data generation
Abstract
The rapid rollout of AI in heterogeneous public and societal sectors has subsequently escalated the need for compliance with regulatory standards and frameworks. The EU AI Act has emerged as a landmark in the regulatory landscape. The development of solutions that elicit the level of AI systems' compliance with such standards is often limited by the lack of resources, hindering the semi-automated or automated evaluation of their performance. This generates the need for manual work, which is often error-prone, resource-limited or limited to cases not clearly described by the regulation. This paper presents an open, transparent, and reproducible method of creating a resource that facilitates the evaluation of NLP models with a strong focus on RAG systems. We have developed a dataset that contain the tasks of risk-level classification, article retrieval, obligation generation, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Artificial Intelligence in Healthcare and Education · Topic Modeling
