A Method for Parsing and Vectorization of Semi-structured Data used in Retrieval Augmented Generation
Hang Yang, Jing Guo, Jianchuan Qi, Jinliang Xie, Si Zhang, Siqi Yang,, Nan Li, Ming Xu

TL;DR
This paper introduces a new method for parsing and vectorizing semi-structured data to improve Retrieval-Augmented Generation in large language models, especially for environmental science applications.
Contribution
The paper presents a comprehensive pipeline for converting diverse data formats into structured vectors using Pinecone, enhancing LLM performance in specialized domains.
Findings
Improved accuracy and reliability of LLM outputs in environmental management.
Effective parsing and vectorization of multi-format semi-structured data.
Enhanced context-specific response generation in LLMs.
Abstract
This paper presents a novel method for parsing and vectorizing semi-structured data to enhance the functionality of Retrieval-Augmented Generation (RAG) within Large Language Models (LLMs). We developed a comprehensive pipeline for converting various data formats into .docx, enabling efficient parsing and structured data extraction. The core of our methodology involves the construction of a vector database using Pinecone, which integrates seamlessly with LLMs to provide accurate, context-specific responses, particularly in environmental management and wastewater treatment operations. Through rigorous testing with both English and Chinese texts in diverse document formats, our results demonstrate a marked improvement in the precision and reliability of LLMs outputs. The RAG-enhanced models displayed enhanced ability to generate contextually rich and technically accurate responses,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Advanced Data Compression Techniques
