Cross-Document Topic-Aligned Chunking for Retrieval-Augmented Generation
Mile Stankovic

TL;DR
This paper introduces Cross-Document Topic-Aligned chunking, a method that reconstructs knowledge across multiple documents to improve retrieval-augmented generation, especially for complex multi-source queries, by creating unified, information-dense chunks.
Contribution
It presents a novel cross-document topic-aligned chunking approach that enhances knowledge reconstruction and retrieval efficiency in RAG systems, outperforming existing methods in faithfulness and citation accuracy.
Findings
Achieved 0.93 faithfulness on HotpotQA, outperforming industry best by 12%.
Reached 0.94 faithfulness on UAE Legal texts with high citation accuracy.
Maintains high faithfulness at low retrievals, reducing query-time retrieval needs.
Abstract
Chunking quality determines RAG system performance. Current methods partition documents individually, but complex queries need information scattered across multiple sources: the knowledge fragmentation problem. We introduce Cross-Document Topic-Aligned (CDTA) chunking, which reconstructs knowledge at the corpus level. It first identifies topics across documents, maps segments to each topic, and synthesizes them into unified chunks. On HotpotQA multi-hop reasoning, our method reached 0.93 faithfulness versus 0.83 for contextual retrieval and 0.78 for semantic chunking, a 12% improvement over current industry best practice (p < 0.05). On UAE Legal texts, it reached 0.94 faithfulness with 0.93 citation accuracy. At k = 3, it maintains 0.91 faithfulness while semantic methods drop to 0.68, with a single CDTA chunk containing information requiring multiple traditional fragments. Indexing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Semantic Web and Ontologies · Natural Language Processing Techniques
