Does Synthetic Data Generation of LLMs Help Clinical Text Mining?
Ruixiang Tang, Xiaotian Han, Xiaoqian Jiang, Xia Hu

TL;DR
This paper explores using synthetic data generated by ChatGPT to improve clinical text mining, addressing privacy issues and enhancing model performance in extracting biological entities and relations from healthcare texts.
Contribution
The study introduces a novel training paradigm that leverages ChatGPT to generate labeled synthetic data for fine-tuning local models in clinical text mining tasks.
Findings
F1-score for named entity recognition improved from 23.37% to 63.99%.
F1-score for relation extraction increased from 75.86% to 83.59%.
Synthetic data generation reduces data collection time and privacy concerns.
Abstract
Recent advancements in large language models (LLMs) have led to the development of highly potent models like OpenAI's ChatGPT. These models have exhibited exceptional performance in a variety of tasks, such as question answering, essay composition, and code generation. However, their effectiveness in the healthcare sector remains uncertain. In this study, we seek to investigate the potential of ChatGPT to aid in clinical text mining by examining its ability to extract structured information from unstructured healthcare texts, with a focus on biological named entity recognition and relation extraction. However, our preliminary results indicate that employing ChatGPT directly for these tasks resulted in poor performance and raised privacy concerns associated with uploading patients' information to the ChatGPT API. To overcome these limitations, we propose a new training paradigm that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling
