DialogAgent: An Auto-engagement Agent for Code Question Answering Data Production
Xiaoyun Liang, Jingyi Ren, Jiayi Qi, Chao Peng, Bo Jiang

TL;DR
DialogAgent is an automated tool that generates realistic synthetic training data for code question-answering, significantly improving model performance and reducing manual data creation efforts.
Contribution
We introduce DialogAgent, a novel system for producing high-quality synthetic developer interaction data to enhance code-related language models.
Findings
Increased data generation efficiency by 4.8 times
33% improvement in response acceptance rate
Enhanced model performance on code QA tasks
Abstract
Large Language Models (LLMs) have become increasingly integral to enhancing developer productivity, particularly in code generation, comprehension, and repair tasks. However, fine-tuning these models with high-quality, real-world data is challenging due to privacy concerns and the lack of accessible, labeled datasets. In this paper, we present DialogAgent, an automated tool for generating synthetic training data that closely mimics real developer interactions within Integrated Development Environments (IDEs). DialogAgent enables the production of diverse, high-fidelity query-response pairs by simulating multi-turn dialogues and contextual behaviors observed in real-world programming scenarios. The tool significantly reduces the reliance on manual data generation, increasing efficiency by 4.8 times compared to traditional methods. Our experiments and online deployment demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Topic Modeling
