Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents

Ziyang Miao; Qiyu Sun; Jingyuan Wang; Yuchen Gong; Yaowei Zheng; Shiqi Li; Richong Zhang

arXiv:2507.04009·cs.CL·July 8, 2025

Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents

Ziyang Miao, Qiyu Sun, Jingyuan Wang, Yuchen Gong, Yaowei Zheng, Shiqi Li, Richong Zhang

PDF

TL;DR

Easy Dataset provides an accessible, GUI-based framework for synthesizing high-quality, domain-specific fine-tuning data from unstructured documents, enhancing LLM adaptation with human-in-the-loop review.

Contribution

It introduces a unified, user-friendly system for transforming raw documents into training data, combining configurable extraction, persona-driven prompting, and human review.

Findings

01

Improves domain-specific LLM performance on financial QA tasks.

02

Enables efficient data synthesis with minimal technical expertise.

03

Achieves high user engagement and data quality through visual interfaces.

Abstract

Large language models (LLMs) have shown impressive performance on general-purpose tasks, yet adapting them to specific domains remains challenging due to the scarcity of high-quality domain data. Existing data synthesis tools often struggle to extract reliable fine-tuning data from heterogeneous documents effectively. To address this limitation, we propose Easy Dataset, a unified framework for synthesizing fine-tuning data from unstructured documents via an intuitive graphical user interface (GUI). Specifically, Easy Dataset allows users to easily configure text extraction models and chunking strategies to transform raw documents into coherent text chunks. It then leverages a persona-driven prompting approach to generate diverse question-answer pairs using public-available LLMs. Throughout the pipeline, a human-in-the-loop visual interface facilitates the review and refinement of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.