Large Language Models as Robust Data Generators in Software Analytics:   Are We There Yet?

Md. Abdul Awal; Mrigank Rochan; Chanchal K. Roy

arXiv:2411.10565·cs.SE·May 7, 2025

Large Language Models as Robust Data Generators in Software Analytics: Are We There Yet?

Md. Abdul Awal, Mrigank Rochan, Chanchal K. Roy

PDF

Open Access

TL;DR

This study compares the robustness of models trained on LLM-generated versus human-written data in software analytics, revealing that LLM data yields competitive performance but less adversarial robustness, highlighting areas for improvement.

Contribution

It systematically evaluates the robustness of pre-trained models fine-tuned on LLM-generated data against adversarial attacks across multiple software analytics tasks.

Findings

01

PTMs with LLM-generated data perform similarly to those with human data

02

Models trained on LLM data are less robust to adversarial attacks

03

Further research needed to improve LLM data quality for robustness

Abstract

Large Language Model (LLM)-generated data is increasingly used in software analytics, but it is unclear how this data compares to human-written data, particularly when models are exposed to adversarial scenarios. Adversarial attacks can compromise the reliability and security of software systems, so understanding how LLM-generated data performs under these conditions, compared to human-written data, which serves as the benchmark for model performance, can provide valuable insights into whether LLM-generated data offers similar robustness and effectiveness. To address this gap, we systematically evaluate and compare the quality of human-written and LLM-generated data for fine-tuning robust pre-trained models (PTMs) in the context of adversarial attacks. We evaluate the robustness of six widely used PTMs, fine-tuned on human-written and LLM-generated data, before and after adversarial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Software Engineering Research · Web Application Security Vulnerabilities