From Online User Feedback to Requirements: Evaluating Large Language Models for Classification and Specification Tasks

Manjeshwar Aniruddh Mallya (1); Alessio Ferrari (2); Mohammad Amin Zadenoori (3); Jacek D\k{a}browski (1) ((1) Lero; the Research Ireland Centre for Software; University of Limerick; Ireland (2) University College Dublin (UCD); Ireland (3) University of Padova; Italy)

arXiv:2510.23055·cs.SE·October 28, 2025

From Online User Feedback to Requirements: Evaluating Large Language Models for Classification and Specification Tasks

Manjeshwar Aniruddh Mallya (1), Alessio Ferrari (2), Mohammad Amin Zadenoori (3), Jacek D\k{a}browski (1) ((1) Lero, the Research Ireland Centre for Software, University of Limerick, Ireland (2) University College Dublin (UCD), Ireland (3) University of Padova, Italy)

PDF

TL;DR

This study evaluates lightweight large language models for analyzing online user feedback to support requirements engineering, demonstrating moderate success in classification and specification tasks, and providing insights into their capabilities and limitations.

Contribution

It offers the first empirical evaluation of lightweight LLMs on RE tasks, including a replication package and analysis of their effectiveness and constraints.

Findings

01

LLMs achieved moderate-to-high classification accuracy (F1 ~ 0.47-0.68)

02

Specification quality was moderately high (mean ~ 3/5)

03

Provides insights into LLMs' capabilities and limitations for RE tasks

Abstract

[Context and Motivation] Online user feedback provides valuable information to support requirements engineering (RE). However, analyzing online user feedback is challenging due to its large volume and noise. Large language models (LLMs) show strong potential to automate this process and outperform previous techniques. They can also enable new tasks, such as generating requirements specifications. [Question-Problem] Despite their potential, the use of LLMs to analyze user feedback for RE remains underexplored. Existing studies offer limited empirical evidence, lack thorough evaluation, and rarely provide replication packages, undermining validity and reproducibility. [Principal Idea-Results] We evaluate five lightweight open-source LLMs on three RE tasks: user request classification, NFR classification, and requirements specification generation. Classification performance was…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.