Semantically Aligned Question and Code Generation for Automated Insight Generation
Ananya Singha, Bhavya Chopra, Anirudh Khatry, Sumit Gulwani, Austin Z., Henley, Vu Le, Chris Parnin, Mukul Singh, Gust Verbruggen

TL;DR
This paper presents a method using large language models to generate semantically aligned questions and code for automated insight generation, improving the relevance and diversity of insights for data analysis.
Contribution
It introduces a semantic filtering approach using embeddings to ensure question-code alignment and demonstrates that joint question and code generation enhances diversity.
Findings
Embedding-based filtering effectively removes unaligned question-code pairs
Joint question and code generation increases diversity of insights
Empirical results on Open-WikiTable data validate the approach
Abstract
Automated insight generation is a common tactic for helping knowledge workers, such as data scientists, to quickly understand the potential value of new and unfamiliar data. Unfortunately, automated insights produced by large-language models can generate code that does not correctly correspond (or align) to the insight. In this paper, we leverage the semantic knowledge of large language models to generate targeted and insightful questions about data and the corresponding code to answer those questions. Then through an empirical study on data from Open-WikiTable, we show that embeddings can be effectively used for filtering out semantically unaligned pairs of question and code. Additionally, we found that generating questions and code together yields more diverse questions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Educational Technology and Assessment
