AnnoRetrieve: Efficient Structured Retrieval for Unstructured Document Analysis

Teng Lin; Yuyu Luo; Nan Tang

arXiv:2604.02690·cs.IR·April 6, 2026

AnnoRetrieve: Efficient Structured Retrieval for Unstructured Document Analysis

Teng Lin, Yuyu Luo, Nan Tang

PDF

TL;DR

AnnoRetrieve introduces a structured annotation-based retrieval system that enhances precision and reduces costs in unstructured document analysis by replacing traditional embedding methods with schema-driven queries.

Contribution

The paper presents SchemaBoot and SSR, novel techniques for automatic schema generation and annotation-driven retrieval, eliminating manual schema design and reducing reliance on LLMs.

Findings

01

Reduces LLM calls and retrieval costs significantly.

02

Maintains high accuracy across multiple datasets.

03

Enables precise semantic matching without heavy LLM dependency.

Abstract

Unstructured documents dominate enterprise and web data, but their lack of explicit organization hinders precise information retrieval. Current mainstream retrieval methods, especially embedding-based vector search, rely on coarse-grained semantic similarity, incurring high computational cost and frequent LLM calls for post-processing. To address this critical issue, we propose AnnoRetrieve, a novel retrieval paradigm that shifts from embeddings to structured annotations, enabling precise, annotation-driven semantic retrieval. Our system replaces expensive vector comparisons with lightweight structured queries over automatically induced schemas, dramatically reducing LLM usage and overall cost. The system integrates two synergistic core innovations: SchemaBoot, which automatically generates document annotation schemas via multi-granularity pattern discovery and constraint-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.