TopoChunker: Topology-Aware Agentic Document Chunking Framework
Xiaoyu Liu

TL;DR
TopoChunker is a novel framework that preserves document topology during chunking, improving retrieval quality and efficiency in RAG systems by explicitly modeling hierarchical dependencies.
Contribution
It introduces a dual-agent architecture for topology-aware document chunking, balancing structural fidelity with computational cost, and achieves state-of-the-art results on complex datasets.
Findings
Outperforms baseline by 8.0% in generation accuracy
Achieves 83.26% Recall@3
Reduces token overhead by 23.5%
Abstract
Current document chunking methods for Retrieval-Augmented Generation (RAG) typically linearize text. This forced linearization strips away intrinsic topological hierarchies, creating ``semantic fragmentation'' that degrades downstream retrieval quality. In this paper, we propose TopoChunker, an agentic framework that maps heterogeneous documents onto a Structured Intermediate Representation (SIR) to explicitly preserve cross-segment dependencies. To balance structural fidelity with computational cost, TopoChunker employs a dual-agent architecture. An Inspector Agent dynamically routes documents through cost-optimized extraction paths, while a Refiner Agent performs capacity auditing and topological context disambiguation to reconstruct hierarchical lineage. Evaluated on unstructured narratives (GutenQA) and complex reports (GovReport), TopoChunker demonstrates state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Topic Modeling · Information Retrieval and Search Behavior
