LumberChunker: Long-Form Narrative Document Segmentation
Andr\'e V. Duarte, Jo\~ao Marques, Miguel Gra\c{c}a, Miguel Freire,, Lei Li, Arlindo L. Oliveira

TL;DR
LumberChunker is a novel document segmentation method using LLMs to improve retrieval by dynamically identifying content shifts in long narratives, demonstrated on a new benchmark with significant performance gains.
Contribution
The paper introduces LumberChunker, an LLM-based dynamic segmentation technique for long documents, and presents GutenQA, a new benchmark for evaluating retrieval in narrative texts.
Findings
LumberChunker outperforms baseline methods by 7.37% in retrieval performance.
It is more effective than other chunking methods in RAG pipelines.
The approach demonstrates significant improvements on the GutenQA benchmark.
Abstract
Modern NLP tasks increasingly rely on dense retrieval methods to access up-to-date and relevant contextual information. We are motivated by the premise that retrieval benefits from segments that can vary in size such that a content's semantic independence is better captured. We propose LumberChunker, a method leveraging an LLM to dynamically segment documents, which iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift. To evaluate our method, we introduce GutenQA, a benchmark with 3000 "needle in a haystack" type of question-answer pairs derived from 100 public domain narrative books available on Project Gutenberg. Our experiments show that LumberChunker not only outperforms the most competitive baseline by 7.37% in retrieval performance (DCG@20) but also that, when integrated into a RAG pipeline, LumberChunker proves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Digital Humanities and Scholarship · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Weight Decay · WordPiece · Softmax · Layer Normalization · Linear Warmup With Linear Decay · Byte Pair Encoding · Attention Dropout · Dropout
