StreamLink: Large-Language-Model Driven Distributed Data Engineering System
Dawei Feng, Di Mei, Huiri Tan, Lei Ren, Xianying Lou, Zhangxi Tan

TL;DR
StreamLink leverages domain-adapted large language models integrated with distributed data frameworks to enhance natural language query processing, improve data accessibility, and ensure privacy and security in large-scale data engineering tasks.
Contribution
It introduces a novel LLM-driven distributed data system that combines privacy-preserving local fine-tuned models with secure query generation for scalable data engineering.
Findings
SQL generation accuracy exceeds 10% over baselines
Enables rapid retrieval of top items from hundreds of millions
Integrates LLMs with Spark and Hadoop for scalable data processing
Abstract
Large Language Models (LLMs) have shown remarkable proficiency in natural language understanding (NLU), opening doors for innovative applications. We introduce StreamLink - an LLM-driven distributed data system designed to improve the efficiency and accessibility of data engineering tasks. We build StreamLink on top of distributed frameworks such as Apache Spark and Hadoop to handle large data at scale. One of the important design philosophies of StreamLink is to respect user data privacy by utilizing local fine-tuned LLMs instead of a public AI service like ChatGPT. With help from domain-adapted LLMs, we can improve our system's understanding of natural language queries from users in various scenarios and simplify the procedure of generating database queries like the Structured Query Language (SQL) for information processing. We also incorporate LLM-based syntax and security checkers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Advanced Database Systems and Queries · Scientific Computing and Data Management
Methodstravel james
