Natural Language Dataset Generation Framework for Visualizations Powered by Large Language Models
Hyung-Kwon Ko, Hyeon Jeon, Gwanmo Park, Dae Hyun Kim, Nam Wook Kim,, Juho Kim, Jinwook Seo

TL;DR
VL2NL is a framework that uses large language models to generate diverse natural language datasets from Vega-Lite specifications, improving the development of natural language interfaces for data visualization.
Contribution
The paper introduces VL2NL, a novel LLM-based framework that synthesizes rich NL datasets from Vega-Lite specs with guided discovery and paraphrasing, and provides a new diverse chart collection.
Findings
Achieved 89.4% accuracy in extracting chart semantics.
Generated NL captions with 76.0% accuracy.
Produced more diverse utterances and questions than benchmarks.
Abstract
We introduce VL2NL, a Large Language Model (LLM) framework that generates rich and diverse NL datasets using only Vega-Lite specifications as input, thereby streamlining the development of Natural Language Interfaces (NLIs) for data visualization. To synthesize relevant chart semantics accurately and enhance syntactic diversity in each NL dataset, we leverage 1) a guided discovery incorporated into prompting so that LLMs can steer themselves to create faithful NL datasets in a self-directed manner; 2) a score-based paraphrasing to augment NL syntax along with four language axes. We also present a new collection of 1,981 real-world Vega-Lite specifications that have increased diversity and complexity than existing chart collections. When tested on our chart collection, VL2NL extracted chart semantics and generated L1/L2 captions with 89.4% and 76.0% accuracy, respectively. It also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
