LumberChunker: Long-Form Narrative Document Segmentation

Andr\'e V. Duarte; Jo\~ao Marques; Miguel Gra\c{c}a; Miguel Freire,; Lei Li; Arlindo L. Oliveira

arXiv:2406.17526·cs.CL·June 26, 2024

LumberChunker: Long-Form Narrative Document Segmentation

Andr\'e V. Duarte, Jo\~ao Marques, Miguel Gra\c{c}a, Miguel Freire,, Lei Li, Arlindo L. Oliveira

PDF

Open Access 1 Repo 5 Datasets 1 Video

TL;DR

LumberChunker is a novel document segmentation method using LLMs to improve retrieval by dynamically identifying content shifts in long narratives, demonstrated on a new benchmark with significant performance gains.

Contribution

The paper introduces LumberChunker, an LLM-based dynamic segmentation technique for long documents, and presents GutenQA, a new benchmark for evaluating retrieval in narrative texts.

Findings

01

LumberChunker outperforms baseline methods by 7.37% in retrieval performance.

02

It is more effective than other chunking methods in RAG pipelines.

03

The approach demonstrates significant improvements on the GutenQA benchmark.

Abstract

Modern NLP tasks increasingly rely on dense retrieval methods to access up-to-date and relevant contextual information. We are motivated by the premise that retrieval benefits from segments that can vary in size such that a content's semantic independence is better captured. We propose LumberChunker, a method leveraging an LLM to dynamically segment documents, which iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift. To evaluate our method, we introduce GutenQA, a benchmark with 3000 "needle in a haystack" type of question-answer pairs derived from 100 public domain narrative books available on Project Gutenberg. Our experiments show that LumberChunker not only outperforms the most competitive baseline by 7.37% in retrieval performance (DCG@20) but also that, when integrated into a RAG pipeline, LumberChunker proves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

joaodsmarques/lumberchunker
noneOfficial

Datasets

Videos

LumberChunker: Long-Form Narrative Document Segmentation· underline

Taxonomy

TopicsNatural Language Processing Techniques · Digital Humanities and Scholarship · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Weight Decay · WordPiece · Softmax · Layer Normalization · Linear Warmup With Linear Decay · Byte Pair Encoding · Attention Dropout · Dropout