Large Language Models in Thematic Analysis: Prompt Engineering, Evaluation, and Guidelines for Qualitative Software Engineering Research

Cristina Martinez Montes; Robert Feldt; Cristina Miguel Martos; Sofia Ouhbi; Shweta Premanandan; Daniel Graziotin

arXiv:2510.18456·cs.SE·October 22, 2025

Large Language Models in Thematic Analysis: Prompt Engineering, Evaluation, and Guidelines for Qualitative Software Engineering Research

Cristina Martinez Montes, Robert Feldt, Cristina Miguel Martos, Sofia Ouhbi, Shweta Premanandan, Daniel Graziotin

PDF

Open Access

TL;DR

This study develops a reproducible method for integrating large language models into thematic analysis in software engineering research, evaluating their outputs against expert criteria and providing practical guidelines.

Contribution

It introduces a systematic prompt engineering and evaluation framework for using LLMs in qualitative thematic analysis, filling a gap in reproducibility and methodological guidance.

Findings

01

LLMs' codes were preferred 61% of the time by experts.

02

LLMs can assist with coding but often fragment data and miss latent meanings.

03

Guidelines clarify effective LLM use and when human interpretation is necessary.

Abstract

As artificial intelligence advances, large language models (LLMs) are entering qualitative research workflows, yet no reproducible methods exist for integrating them into established approaches like thematic analysis (TA), one of the most common qualitative methods in software engineering research. Moreover, existing studies lack systematic evaluation of LLM-generated qualitative outputs against established quality criteria. We designed and iteratively refined prompts for Phases 2-5 of Braun and Clarke's reflexive TA, then tested outputs from multiple LLMs against codes and themes produced by experienced researchers. Using 15 interviews on software engineers' well-being, we conducted blind evaluations with four expert evaluators who applied rubrics derived directly from Braun and Clarke's quality criteria. Evaluators preferred LLM-generated codes 61% of the time, finding them…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Software Engineering Techniques and Practices · Ethics and Social Impacts of AI