Unstructured Data Analysis using LLMs: A Comprehensive Benchmark
Qiyan Deng, Jianhui Li, Chengliang Chai, Jinqi Liu, Junzhi She, Kaisen Jin, Zhaoze Sun, Yuhao Deng, Jia Yuan, Ye Yuan, Guoren Wang, Lei Cao

TL;DR
This paper introduces UDA-Bench, a comprehensive benchmark with diverse datasets and queries, to evaluate large language model-based unstructured data analysis systems thoroughly.
Contribution
It provides the first large-scale, high-quality benchmark for evaluating UDA systems, including curated datasets, diverse queries, and detailed analysis of system components.
Findings
Identified key differences in query interfaces and optimization strategies.
Evaluated various UDA systems across multiple datasets and query types.
Provided insights into the strengths and weaknesses of existing approaches.
Abstract
Nowadays, the explosion of unstructured data presents immense analytical value. Leveraging the remarkable capability of large language models (LLMs) in extracting attributes of structured tables from unstructured data, researchers are developing LLM-powered data systems for users to analyze unstructured documents as working with a database. These unstructured data analysis (UDA) systems differ significantly in all aspects, including query interfaces, query optimization strategies, and operator implementations, making it unclear which performs best in which scenario. Unfortunately, there does not exist a comprehensive benchmark that offers high-quality, large-volume, and diverse datasets as well as rich query workload to thoroughly evaluate such systems. To fill this gap, we present UDA-Bench, the first benchmark for unstructured data analysis that meets all the above requirements.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Natural Language Processing Techniques · Computational and Text Analysis Methods
