Contextualized Data-Wrangling Code Generation in Computational Notebooks

Junjie Huang; Daya Guo; Chenglong Wang; Jiazhen Gu; Shuai Lu; Jeevana; Priya Inala; Cong Yan; Jianfeng Gao; Nan Duan; Michael R. Lyu

arXiv:2409.13551·cs.SE·September 23, 2024

Contextualized Data-Wrangling Code Generation in Computational Notebooks

Junjie Huang, Daya Guo, Chenglong Wang, Jiazhen Gu, Shuai Lu, Jeevana, Priya Inala, Cong Yan, Jianfeng Gao, Nan Duan, Michael R. Lyu

PDF

1 Repo

TL;DR

This paper introduces CoCoMine, a method to mine high-quality, context-rich data-wrangling examples from notebooks, and creates CoCoNote, a large dataset to improve code generation models for data wrangling tasks.

Contribution

It presents CoCoMine for extracting contextualized data-wrangling examples and constructs CoCoNote dataset, advancing data wrangling code generation in notebooks.

Findings

01

Incorporating data context improves code generation accuracy.

02

Fine-tuning models on CoCoNote enhances performance.

03

DataCoder encoding method outperforms baselines.

Abstract

Data wrangling, the process of preparing raw data for further analysis in computational notebooks, is a crucial yet time-consuming step in data science. Code generation has the potential to automate the data wrangling process to reduce analysts' overhead by translating user intents into executable code. Precisely generating data wrangling code necessitates a comprehensive consideration of the rich context present in notebooks, including textual context, code context and data context. However, notebooks often interleave multiple non-linear analysis tasks into linear sequence of code blocks, where the contextual dependencies are not clearly reflected. Directly training models with source code blocks fails to fully exploit the contexts for accurate wrangling code generation. To bridge the gap, we aim to construct a high quality datasets with clear and rich contexts to help training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Jun-jie-Huang/CoCoNote
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.