Methods to Assess the UK Government's Current Role as a Data Provider for AI
Neil Majithia, Elena Simperl

TL;DR
This paper introduces two methods to evaluate the UK government's data contribution to AI training, revealing that government websites are significant data sources for LLMs, while open datasets are less influential.
Contribution
It presents novel assessment methods for government data usage in AI training, addressing the challenge of opaque training corpora and offering a framework for organizations.
Findings
UK government websites are important data sources for AI.
Data.gov.uk is not a significant data source for AI.
Proposed methods are reproducible and applicable beyond the UK.
Abstract
Governments typically collect and steward a vast amount of high-quality data on their citizens and institutions, and the UK government is exploring how it can better publish and provision this data to the benefit of the AI landscape. However, the compositions of generative AI training corpora remain closely guarded secrets, making the planning of data sharing initiatives difficult. To address this, we devise two methods to assess UK government data usage for the training of Large Language Models (LLMs) and 'peek behind the curtain' in order to observe the UK government's current contributions as a data provider for AI. The first method, an ablation study that utilises LLM 'unlearning', seeks to examine the importance of the information held on UK government websites for LLMs and their performance in citizen query tasks. The second method, an information leakage study, seeks to ascertain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
MethodsSparse Evolutionary Training · Focus · Attentive Walk-Aggregating Graph Neural Network
