The LLM Data Auditor: A Metric-oriented Survey on Quality and Trustworthiness in Evaluating Synthetic Data

Kaituo Zhang; Mingzhi Hu; Hoang Anh Duy Le; Fariha Kabir Torsha; Zhimeng Jiang; Minh Khai Bui; Chia-Yuan Chang; Yu-Neng Chuang; Zhen Xiong; Ying Lin; Guanchu Wang; Na Zou

arXiv:2601.17717·cs.AI·January 28, 2026

The LLM Data Auditor: A Metric-oriented Survey on Quality and Trustworthiness in Evaluating Synthetic Data

Kaituo Zhang, Mingzhi Hu, Hoang Anh Duy Le, Fariha Kabir Torsha, Zhimeng Jiang, Minh Khai Bui, Chia-Yuan Chang, Yu-Neng Chuang, Zhen Xiong, Ying Lin, Guanchu Wang, Na Zou

PDF

Open Access

TL;DR

This paper introduces the LLM Data Auditor framework, a unified, metric-oriented approach to evaluate the intrinsic quality and trustworthiness of synthetic data generated by LLMs across multiple modalities, highlighting current evaluation gaps.

Contribution

It proposes a comprehensive evaluation framework categorizing intrinsic metrics for synthetic data, analyzing existing methods, and providing recommendations for improved data quality assessment across modalities.

Findings

01

Current evaluation practices have significant deficiencies.

02

Intrinsic metrics can effectively assess data quality and trustworthiness.

03

The framework guides practical application of synthetic data across modalities.

Abstract

Large Language Models (LLMs) have emerged as powerful tools for generating data across various modalities. By transforming data from a scarce resource into a controllable asset, LLMs mitigate the bottlenecks imposed by the acquisition costs of real-world data for model training, evaluation, and system iteration. However, ensuring the high quality of LLM-generated synthetic data remains a critical challenge. Existing research primarily focuses on generation methodologies, with limited direct attention to the quality of the resulting data. Furthermore, most studies are restricted to single modalities, lacking a unified perspective across different data types. To bridge this gap, we propose the \textbf{LLM Data Auditor framework}. In this framework, we first describe how LLMs are utilized to generate data across six distinct modalities. More importantly, we systematically categorize…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Machine Learning and Data Classification · Business Process Modeling and Analysis