From Reviews to Requirements: Can LLMs Generate Human-Like User Stories?
Shadman Sakib, Oishy Fatema Akhand, Tasnia Tasneem, Shohel Ahmed

TL;DR
This paper investigates the capability of large language models to convert messy app store reviews into structured, actionable user stories for agile development, demonstrating promising results with some limitations.
Contribution
It is the first comprehensive evaluation of LLMs for transforming raw app reviews into backlog-ready user stories using multiple prompting strategies.
Findings
LLMs can generate fluent, well-formatted user stories that match or outperform human quality.
Few-shot prompting improves the quality of generated user stories.
LLMs struggle with producing independent and diverse user stories for a comprehensive backlog.
Abstract
App store reviews provide a constant flow of real user feedback that can help improve software requirements. However, these reviews are often messy, informal, and difficult to analyze manually at scale. Although automated techniques exist, many do not perform well when replicated and often fail to produce clean, backlog-ready user stories for agile projects. In this study, we evaluate how well large language models (LLMs) such as GPT-3.5 Turbo, Gemini 2.0 Flash, and Mistral 7B Instruct can generate usable user stories directly from raw app reviews. Using the Mini-BAR dataset of 1,000+ health app reviews, we tested zero-shot, one-shot, and two-shot prompting methods. We evaluated the generated user stories using both human judgment (via the RUST framework) and a RoBERTa classifier fine-tuned on UStAI to assess their overall quality. Our results show that LLMs can match or even…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
