From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

Jos\'e Guilherme Marques dos Santos; Ricardo Yang; Rui Humberto Pereira; Alexandre Sousa; Br\'igida M\'onica Faria; Henrique Lopes Cardoso; Jos\'e Duarte; Jos\'e Lu\'is Reis; Lu\'is Paulo Reis; Pedro Pimenta; Jos\'e Paulo Marques dos Santos

arXiv:2604.04948·cs.IR·April 8, 2026

From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

Jos\'e Guilherme Marques dos Santos, Ricardo Yang, Rui Humberto Pereira, Alexandre Sousa, Br\'igida M\'onica Faria, Henrique Lopes Cardoso, Jos\'e Duarte, Jos\'e Lu\'is Reis, Lu\'is Paulo Reis, Pedro Pimenta, Jos\'e Paulo Marques dos Santos

PDF

TL;DR

This study systematically evaluates how different PDF processing frameworks and configurations affect the accuracy of domain-specific question answering in RAG systems, highlighting data preparation as a key factor.

Contribution

It provides the first comprehensive comparison of open-source PDF-to-Markdown conversion tools and their impact on RAG question-answering accuracy.

Findings

01

Docling with hierarchical splitting achieved 94.1% accuracy.

02

Metadata enrichment and hierarchy-aware chunking significantly improved results.

03

Font-based hierarchy rebuilding outperformed LLM-based approaches.

Abstract

Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated PDF processing frameworks by their impact on downstream question-answering accuracy. We address this gap through a systematic comparison of four open-source PDF-to-Markdown conversion frameworks, Docling, MinerU, Marker, and DeepSeek OCR, across 19 pipeline configurations for extracting text and other contents from PDFs, varying the conversion tool, cleaning transformations, splitting strategy, and metadata enrichment. Evaluation was performed using a manually curated 50-question benchmark over a corpus of 36 Portuguese administrative documents (1,706 pages, ~492K words), with LLM-as-judge scoring averaged over 10 runs. Two baselines bounded the results: na\"ive PDFLoader (86.9%) and manually curated Markdown (97.1%). Docling with hierarchical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.