DialogAgent: An Auto-engagement Agent for Code Question Answering Data   Production

Xiaoyun Liang; Jingyi Ren; Jiayi Qi; Chao Peng; Bo Jiang

arXiv:2412.08069·cs.SE·December 12, 2024

DialogAgent: An Auto-engagement Agent for Code Question Answering Data Production

Xiaoyun Liang, Jingyi Ren, Jiayi Qi, Chao Peng, Bo Jiang

PDF

Open Access

TL;DR

DialogAgent is an automated tool that generates realistic synthetic training data for code question-answering, significantly improving model performance and reducing manual data creation efforts.

Contribution

We introduce DialogAgent, a novel system for producing high-quality synthetic developer interaction data to enhance code-related language models.

Findings

01

Increased data generation efficiency by 4.8 times

02

33% improvement in response acceptance rate

03

Enhanced model performance on code QA tasks

Abstract

Large Language Models (LLMs) have become increasingly integral to enhancing developer productivity, particularly in code generation, comprehension, and repair tasks. However, fine-tuning these models with high-quality, real-world data is challenging due to privacy concerns and the lack of accessible, labeled datasets. In this paper, we present DialogAgent, an automated tool for generating synthetic training data that closely mimics real developer interactions within Integrated Development Environments (IDEs). DialogAgent enables the production of diverse, high-fidelity query-response pairs by simulating multi-turn dialogues and contextual behaviors observed in real-world programming scenarios. The tool significantly reduces the reliance on manual data generation, increasing efficiency by 4.8 times compared to traditional methods. Our experiments and online deployment demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Topic Modeling