Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG
Omar El Bachyr, Yewei Song, Saad Ezzini, Jacques Klein, Tegawend\'e F. Bissyand\'e, Anas Zilali, Ulrick Ble, Anne Goujon

TL;DR
This paper systematically evaluates PDF parsing and chunking strategies within RAG systems for financial question answering, providing practical guidelines for improving PDF understanding robustness.
Contribution
It offers a comprehensive study of how different PDF parsers and chunking methods impact RAG performance in financial QA tasks, including a new benchmark dataset.
Findings
Certain parser and chunking combinations significantly improve answer accuracy.
Overlapping chunks help preserve document structure and enhance retrieval quality.
Guidelines for building more robust PDF-based RAG pipelines are proposed.
Abstract
PDF files are primarily intended for human reading rather than automated processing. In addition, the heterogeneous content of PDFs, such as text, tables, and images, poses significant challenges for parsing and information extraction. To address these difficulties, both practitioners and researchers are increasingly developing new methods, including the promising Retrieval-Augmented Generation (RAG) systems to automated PDF processing. However, there is no comprehensive study investigating how different components and design choices affect the performance of a RAG system for understanding PDFs. In this paper, we propose such a study (1) by focusing on Question Answering, a specific language understanding task, and (2) by leveraging two benchmarks from the financial domain, including TableQuest, our newly generated, publicly available benchmark. We systematically examine multiple PDF…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
