Prompt Engineering Strategies for LLM-based Qualitative Coding of Psychological Safety in Software Engineering Communities: A Controlled Empirical Study

Moaath Alshaikh; Tasneem Alshaher; Ricardo Vieira; Beatriz Santana; Clelio Xavier; Jose Amancio; Glauco Carneiro; Julio Leite; Savio Freire; and Manoel Mendonca

arXiv:2605.07422·cs.SE·May 11, 2026

Prompt Engineering Strategies for LLM-based Qualitative Coding of Psychological Safety in Software Engineering Communities: A Controlled Empirical Study

Moaath Alshaikh, Tasneem Alshaher, Ricardo Vieira, Beatriz Santana, Clelio Xavier, Jose Amancio, Glauco Carneiro, Julio Leite, Savio Freire, and Manoel Mendonca

PDF

TL;DR

This study empirically evaluates how different prompt strategies affect the reliability of LLMs in qualitative coding of psychological safety in software engineering, revealing key insights on model stability and biases.

Contribution

It provides a controlled comparison of three LLMs under various prompting strategies, offering empirical guidelines for prompt engineering in qualitative analysis tasks.

Findings

01

Multi-shot prompting improves agreement for Claude Haiku.

02

DeepSeek-Chat and Claude Haiku show low variance in stability.

03

All models tend to over-predict 'Sharing Negative Feedback'.

Abstract

Qualitative analysis plays a pivotal role in understanding the human and social aspects of software engineering. However, it remains a demanding process shaped by the subjective interpretation of individual researchers and sensitive to methodological choices such as prompt design. Recent advancements in Large Language Models (LLMs) offer promising opportunities to support this type of analysis, although their reliability in reproducing human qualitative reasoning under varying prompting conditions remains largely untested. This study presents a controlled empirical evaluation of three LLMs -- Claude Haiku, DeepSeek-Chat, and Gemini 2.5 Flash -- across two prompt engineering strategies (zero-shot and multi-shot closed coding), using Cohen's kappa as the primary agreement metric over ten independent runs per configuration. Results suggest that multi-shot prompting significantly improves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.