LLMs for Qualitative Data Analysis Fail on Security-specificComments in Human Experiments
Maria Camporese, Fabio Massacci, Yuanjun Gong

TL;DR
This study evaluates the effectiveness of large language models in automating the annotation of security-specific comments in human experiments, finding limited success in replacing human annotators.
Contribution
It provides an empirical assessment of LLMs for security comment annotation, highlighting their current limitations and the need for further research.
Findings
LLMs show limited reliability in security comment annotation.
Detailed code descriptions improve LLM performance somewhat.
Current LLMs cannot reliably replace human annotators for this task.
Abstract
[Background:] Thematic analysis of free-text justifications in human experiments provides significant qualitative insights. Yet, it is costly because reliable annotations require multiple domain experts. Large language models (LLMs) seem ideal candidates to replace human annotators. [Problem:] Coding security-specific aspects (code identifiers mentioned, lines-of-code mentioned, security keywords mentioned) may require deeper contextual understanding than sentiment classification. [Objective:] Explore whether LLMs can act as automated annotators for technical security comments by human subjects. [Method:] We prompt four top-performing LLMs on LiveBench to detect nine security-relevant codes in free-text comments by human subjects analyzing vulnerable code snippets. Outputs are compared to human annotators using Cohen's Kappa (chance-corrected accuracy). We test different prompts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
