Towards Accurate and Efficient Document Analytics with Large Language   Models

Yiming Lin; Madelon Hulsebos; Ruiying Ma; Shreya Shankar; Sepanta; Zeigham; Aditya G. Parameswaran; Eugene Wu

arXiv:2405.04674·cs.DB·May 9, 2024·2 cites

Towards Accurate and Efficient Document Analytics with Large Language Models

Yiming Lin, Madelon Hulsebos, Ruiying Ma, Shreya Shankar, Sepanta, Zeigham, Aditya G. Parameswaran, Eugene Wu

PDF

Open Access

TL;DR

ZenDB is a system that uses semantic structures in templatized unstructured documents combined with LLMs to enable accurate, efficient, and cost-effective ad-hoc SQL queries on large document collections.

Contribution

The paper introduces ZenDB, a novel system that leverages semantic hierarchical structures in templatized documents to improve query accuracy and reduce costs compared to existing LLM and RAG-based methods.

Findings

01

Achieves up to 30% cost savings over LLM baselines.

02

Surpasses RAG baselines by 61% in precision and 80% in recall.

03

Maintains or improves accuracy while reducing costs.

Abstract

Unstructured data formats account for over 80% of the data currently stored, and extracting value from such formats remains a considerable challenge. In particular, current approaches for managing unstructured documents do not support ad-hoc analytical queries on document collections. Moreover, Large Language Models (LLMs) directly applied to the documents themselves, or on portions of documents through a process of Retrieval-Augmented Generation (RAG), fail to provide high accuracy query results, and in the LLM-only case, additionally incur high costs. Since many unstructured documents in a collection often follow similar templates that impart a common semantic structure, we introduce ZenDB, a document analytics system that leverages this semantic structure, coupled with LLMs, to answer ad-hoc SQL queries on document collections. ZenDB efficiently extracts semantic hierarchical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies