A Survey on Data Security in Large Language Models

Kang Chen; Xiuze Zhou; Yuanguo Lin; Jinhe Su; Yuanhui Yu; Li Shen; Fan Lin

arXiv:2508.02312·cs.CR·August 5, 2025

A Survey on Data Security in Large Language Models

Kang Chen, Xiuze Zhou, Yuanguo Lin, Jinhe Su, Yuanhui Yu, Li Shen, Fan Lin

PDF

Open Access

TL;DR

This survey reviews data security risks in Large Language Models, discusses current defense strategies, analyzes datasets for robustness, and outlines future research directions for safer LLM deployment.

Contribution

It provides a comprehensive overview of data security challenges in LLMs, categorizes defense methods, and offers guidance for future research and policy development.

Findings

01

Identification of key data security risks in LLMs

02

Analysis of current defense strategies like adversarial training and RLHF

03

Guidance on datasets for robustness evaluation

Abstract

Large Language Models (LLMs), now a foundation in advancing natural language processing, power applications such as text generation, machine translation, and conversational systems. Despite their transformative potential, these models inherently rely on massive amounts of training data, often collected from diverse and uncurated sources, which exposes them to serious data security risks. Harmful or malicious data can compromise model behavior, leading to issues such as toxic output, hallucinations, and vulnerabilities to threats such as prompt injection or data poisoning. As LLMs continue to be integrated into critical real-world systems, understanding and addressing these data-centric security risks is imperative to safeguard user trust and system reliability. This survey offers a comprehensive overview of the main data security risks facing LLMs and reviews current defense strategies,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Big Data and Digital Economy · Privacy-Preserving Technologies in Data