S2 Chunking: A Hybrid Framework for Document Segmentation Through Integrated Spatial and Semantic Analysis
Prashant Verma

TL;DR
This paper presents a hybrid document segmentation framework that integrates spatial layout, semantic analysis, and spatial relationships to improve chunking accuracy in complex documents, outperforming traditional semantic-only methods.
Contribution
It introduces a novel hybrid approach combining layout, semantic, and spatial data with spectral clustering for more accurate document segmentation.
Findings
Outperforms traditional semantic-only methods in diverse layouts
Ensures chunks do not exceed specified token lengths
Effective in complex, multi-column documents
Abstract
Document chunking is a critical task in natural language processing (NLP) that involves dividing a document into meaningful segments. Traditional methods often rely solely on semantic analysis, ignoring the spatial layout of elements, which is crucial for understanding relationships in complex documents. This paper introduces a novel hybrid approach that combines layout structure, semantic analysis, and spatial relationships to enhance the cohesion and accuracy of document chunks. By leveraging bounding box information (bbox) and text embeddings, our method constructs a weighted graph representation of document elements, which is then clustered using spectral clustering. Experimental results demonstrate that this approach outperforms traditional methods, particularly in documents with diverse layouts such as reports, articles, and multi-column designs. The proposed method also ensures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGeographic Information Systems Studies
