SyntheT2C: Generating Synthetic Data for Fine-Tuning Large Language Models on the Text2Cypher Task
Ziije Zhong, Linqing Zhong, Zhaoze Sun, Qingyun Jin, Zengchang Qin,, Xiaofan Zhang

TL;DR
This paper introduces SyntheT2C, a method for creating synthetic datasets to improve large language models' ability to translate natural language into Cypher queries for knowledge graph databases, especially in the medical domain.
Contribution
The paper presents a novel synthetic data generation approach for the Text2Cypher task, enabling better fine-tuning of LLMs without requiring extensive manual annotation.
Findings
Synthetic MedT2C dataset improves LLM performance on Text2Cypher
SyntheT2C combines prompting and template-filling pipelines
Enhanced LLMs show better accuracy in generating Cypher queries
Abstract
Integrating Large Language Models (LLMs) with existing Knowledge Graph (KG) databases presents a promising avenue for enhancing LLMs' efficacy and mitigating their "hallucinations". Given that most KGs reside in graph databases accessible solely through specialized query languages (e.g., Cypher), it is critical to connect LLMs with KG databases by automating the translation of natural language into Cypher queries (termed as "Text2Cypher" task). Prior efforts tried to bolster LLMs' proficiency in Cypher generation through Supervised Fine-Tuning (SFT). However, these explorations are hindered by the lack of annotated datasets of Query-Cypher pairs, resulting from the labor-intensive and domain-specific nature of such annotation. In this study, we propose SyntheT2C, a methodology for constructing a synthetic Query-Cypher pair dataset, comprising two distinct pipelines: (1) LLM-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational Physics and Python Applications
MethodsShrink and Fine-Tune
