Characterizing Datasets for LLM-based Requirements Engineering: A Systematic Mapping Study
Quim Motger, Carlota Catot, Xavier Franch

TL;DR
This systematic mapping study characterizes 62 datasets used in LLM-based Requirements Engineering, highlighting gaps and providing a structured scheme to improve dataset reuse, comparison, and research foundations.
Contribution
It introduces a comprehensive characterization scheme for datasets in LLM-based RE and analyzes their usage, revealing imbalances and areas needing more diversity.
Findings
62 datasets characterized across multiple dimensions.
Limited support for elicitation activities and language diversity.
Notable imbalances and gaps in dataset usage and diversity.
Abstract
Large Language Models (LLMs) depend on high-quality, domain-specific natural language datasets. This dependency is particularly pronounced in Requirements Engineering (RE), where core activities rely on textual artifacts such as requirements, specifications, and stakeholder feedback. Despite the increasing use of LLMs in RE, data scarcity remains a widely reported limitation. While several datasets support LLM-based RE research, they are scattered across studies and lack systematic characterization, hindering reuse, comparability and assessment. This paper addresses this gap by examining which public datasets are used in LLM-based RE, how they can be consistently characterized, and which RE tasks and dataset properties remain under-represented. We report on a systematic mapping study of 45 primary studies referencing 62 publicly available datasets. Each dataset is characterized using a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
