TL;DR
This paper introduces CoCoMine, a method to mine high-quality, context-rich data-wrangling examples from notebooks, and creates CoCoNote, a large dataset to improve code generation models for data wrangling tasks.
Contribution
It presents CoCoMine for extracting contextualized data-wrangling examples and constructs CoCoNote dataset, advancing data wrangling code generation in notebooks.
Findings
Incorporating data context improves code generation accuracy.
Fine-tuning models on CoCoNote enhances performance.
DataCoder encoding method outperforms baselines.
Abstract
Data wrangling, the process of preparing raw data for further analysis in computational notebooks, is a crucial yet time-consuming step in data science. Code generation has the potential to automate the data wrangling process to reduce analysts' overhead by translating user intents into executable code. Precisely generating data wrangling code necessitates a comprehensive consideration of the rich context present in notebooks, including textual context, code context and data context. However, notebooks often interleave multiple non-linear analysis tasks into linear sequence of code blocks, where the contextual dependencies are not clearly reflected. Directly training models with source code blocks fails to fully exploit the contexts for accurate wrangling code generation. To bridge the gap, we aim to construct a high quality datasets with clear and rich contexts to help training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
