Novel Preprocessing Technique for Data Embedding in Engineering Code Generation Using Large Language Model
Yu-Chen Lin, Akhilesh Kumar, Norman Chang, Wenliang Zhang, Muhammad, Zakir, Rucha Apte, Haiyang He, Chao Wang, Jyh-Shing Roger Jang

TL;DR
This paper introduces innovative data preprocessing and verification techniques to improve large language models' ability to generate accurate, domain-specific engineering code, demonstrated through a case study with RedHawk-SC software.
Contribution
The paper presents new LLM-based data splitting, renovation, and verification methods, along with a prompt technique, to enhance domain-specific code generation performance.
Findings
Achieved 73.33% correct lines in MapReduce code generation
Enhanced retrieval relevance with IKEC and RAG techniques
Demonstrated effectiveness in engineering simulation software context
Abstract
We present four main contributions to enhance the performance of Large Language Models (LLMs) in generating domain-specific code: (i) utilizing LLM-based data splitting and data renovation techniques to improve the semantic representation of embeddings' space; (ii) introducing the Chain of Density for Renovation Credibility (CoDRC), driven by LLMs, and the Adaptive Text Renovation (ATR) algorithm for assessing data renovation reliability; (iii) developing the Implicit Knowledge Expansion and Contemplation (IKEC) Prompt technique; and (iv) effectively refactoring existing scripts to generate new and high-quality scripts with LLMs. By using engineering simulation software RedHawk-SC as a case study, we demonstrate the effectiveness of our data pre-processing method for expanding and categorizing scripts. When combined with IKEC, these techniques enhance the Retrieval-Augmented Generation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Data Quality and Management
