Towards Accurate and Efficient Document Analytics with Large Language Models
Yiming Lin, Madelon Hulsebos, Ruiying Ma, Shreya Shankar, Sepanta, Zeigham, Aditya G. Parameswaran, Eugene Wu

TL;DR
ZenDB is a system that uses semantic structures in templatized unstructured documents combined with LLMs to enable accurate, efficient, and cost-effective ad-hoc SQL queries on large document collections.
Contribution
The paper introduces ZenDB, a novel system that leverages semantic hierarchical structures in templatized documents to improve query accuracy and reduce costs compared to existing LLM and RAG-based methods.
Findings
Achieves up to 30% cost savings over LLM baselines.
Surpasses RAG baselines by 61% in precision and 80% in recall.
Maintains or improves accuracy while reducing costs.
Abstract
Unstructured data formats account for over 80% of the data currently stored, and extracting value from such formats remains a considerable challenge. In particular, current approaches for managing unstructured documents do not support ad-hoc analytical queries on document collections. Moreover, Large Language Models (LLMs) directly applied to the documents themselves, or on portions of documents through a process of Retrieval-Augmented Generation (RAG), fail to provide high accuracy query results, and in the LLM-only case, additionally incur high costs. Since many unstructured documents in a collection often follow similar templates that impart a common semantic structure, we introduce ZenDB, a document analytics system that leverages this semantic structure, coupled with LLMs, to answer ad-hoc SQL queries on document collections. ZenDB efficiently extracts semantic hierarchical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
