S2 Chunking: A Hybrid Framework for Document Segmentation Through   Integrated Spatial and Semantic Analysis

Prashant Verma

arXiv:2501.05485·cs.CL·January 13, 2025

S2 Chunking: A Hybrid Framework for Document Segmentation Through Integrated Spatial and Semantic Analysis

Prashant Verma

PDF

Open Access 1 Repo

TL;DR

This paper presents a hybrid document segmentation framework that integrates spatial layout, semantic analysis, and spatial relationships to improve chunking accuracy in complex documents, outperforming traditional semantic-only methods.

Contribution

It introduces a novel hybrid approach combining layout, semantic, and spatial data with spectral clustering for more accurate document segmentation.

Findings

01

Outperforms traditional semantic-only methods in diverse layouts

02

Ensures chunks do not exceed specified token lengths

03

Effective in complex, multi-column documents

Abstract

Document chunking is a critical task in natural language processing (NLP) that involves dividing a document into meaningful segments. Traditional methods often rely solely on semantic analysis, ignoring the spatial layout of elements, which is crucial for understanding relationships in complex documents. This paper introduces a novel hybrid approach that combines layout structure, semantic analysis, and spatial relationships to enhance the cohesion and accuracy of document chunks. By leveraging bounding box information (bbox) and text embeddings, our method constructs a weighted graph representation of document elements, which is then clustered using spectral clustering. Experimental results demonstrate that this approach outperforms traditional methods, particularly in documents with diverse layouts such as reports, articles, and multi-column designs. The proposed method also ensures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Vprashant/s2-chunking-lib
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGeographic Information Systems Studies