AnnoRetrieve: Efficient Structured Retrieval for Unstructured Document Analysis
Teng Lin, Yuyu Luo, Nan Tang

TL;DR
AnnoRetrieve introduces a structured annotation-based retrieval system that enhances precision and reduces costs in unstructured document analysis by replacing traditional embedding methods with schema-driven queries.
Contribution
The paper presents SchemaBoot and SSR, novel techniques for automatic schema generation and annotation-driven retrieval, eliminating manual schema design and reducing reliance on LLMs.
Findings
Reduces LLM calls and retrieval costs significantly.
Maintains high accuracy across multiple datasets.
Enables precise semantic matching without heavy LLM dependency.
Abstract
Unstructured documents dominate enterprise and web data, but their lack of explicit organization hinders precise information retrieval. Current mainstream retrieval methods, especially embedding-based vector search, rely on coarse-grained semantic similarity, incurring high computational cost and frequent LLM calls for post-processing. To address this critical issue, we propose AnnoRetrieve, a novel retrieval paradigm that shifts from embeddings to structured annotations, enabling precise, annotation-driven semantic retrieval. Our system replaces expensive vector comparisons with lightweight structured queries over automatically induced schemas, dramatically reducing LLM usage and overall cost. The system integrates two synergistic core innovations: SchemaBoot, which automatically generates document annotation schemas via multi-granularity pattern discovery and constraint-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
