SPIRE: Structure-Preserving Interpretable Retrieval of Evidence
Mike Rainey, Umut Acar, Muhammed Sezer

TL;DR
This paper introduces SPIRE, a structure-preserving retrieval system for semi-structured documents like HTML, enhancing evidence retrieval by maintaining document structure and providing more interpretable, citation-ready results.
Contribution
SPIRE presents a novel, structure-aware retrieval pipeline that operates over tree-structured documents, improving evidence quality and interpretability over traditional linearized methods.
Findings
Higher-quality, diverse citations achieved with structure preservation.
Outperforms passage-based baselines on HTML question-answering benchmarks.
Maintains scalability while enhancing interpretability.
Abstract
Retrieval-augmented generation over semi-structured sources such as HTML is constrained by a mismatch between document structure and the flat, sequence-based interfaces of today's embedding and generative models. Retrieval pipelines often linearize documents into fixed-size chunks before indexing, which obscures section structure, lists, and tables, and makes it difficult to return small, citation-ready evidence without losing the surrounding context that makes it interpretable. We present a structure-aware retrieval pipeline that operates over tree-structured documents. The core idea is to represent candidates as subdocuments: precise, addressable selections that preserve structural identity while deferring the choice of surrounding context. We define a small set of document primitives--paths and path sets, subdocument extraction by pruning, and two contextualization mechanisms.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
