Evaluating Chunking Strategies For Retrieval-Augmented Generation in Oil and Gas Enterprise Documents
Samuel Taiwo, Mohd Amaluddin Yusoff

TL;DR
This study compares four document chunking strategies for retrieval-augmented generation in oil and gas documents, finding structure-aware methods improve retrieval effectiveness and efficiency, but struggle with visually encoded diagrams, highlighting the need for multimodal approaches.
Contribution
It provides an empirical evaluation of chunking strategies in a specialized domain, emphasizing the importance of structure-aware methods and identifying limitations with visual documents.
Findings
Structure-aware chunking improves retrieval metrics.
Structure-aware methods have lower computational costs.
All methods perform poorly on visually encoded diagrams.
Abstract
Retrieval-Augmented Generation (RAG) has emerged as a framework to address the constraints of Large Language Models (LLMs). Yet, its effectiveness fundamentally hinges on document chunking - an often-overlooked determinant of its quality. This paper presents an empirical study quantifying performance differences across four chunking strategies: fixed-size sliding window, recursive, breakpoint-based semantic, and structure-aware. We evaluated these methods using a proprietary corpus of oil and gas enterprise documents, including text-heavy manuals, table-heavy specifications, and piping and instrumentation diagrams (P and IDs). Our findings show that structure-aware chunking yields higher overall retrieval effectiveness, particularly in top-K metrics, and incurs significantly lower computational costs than semantic or baseline strategies. Crucially, all four methods demonstrated limited…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Information Retrieval and Search Behavior
