Lightweight and Production-Ready PDF Visual Element Parsing
Meizhu Liu, Yassi Abbasi, Matthew Rowe, Michael Avendi, Paul Li

TL;DR
This paper introduces a lightweight PDF parsing framework that accurately detects visual elements and associates captions, significantly improving downstream document understanding tasks while being suitable for production deployment.
Contribution
The authors develop a novel PDF parsing system combining heuristics, layout analysis, and semantic similarity, achieving high accuracy and efficiency in production environments.
Findings
Achieves ≥96% visual element detection accuracy
Attains 93% caption association accuracy
Outperforms state-of-the-art parsers and models in retrieval and QA tasks
Abstract
PDF documents contain critical visual elements such as figures, tables, and forms whose accurate extraction is essential for document understanding and multimodal retrieval-augmented generation (RAG). Existing PDF parsers often miss complex visuals, extract non-informative artifacts (e.g., watermarks, logos), produce fragmented elements, and fail to reliably associate captions with their corresponding elements, which degrades downstream retrieval and question answering. We present a lightweight and production level PDF parsing framework that can accurately detect visual elements and associates captions using a combination of spatial heuristics, layout analysis, and semantic similarity. On popular benchmark datasets and internal product data, the proposed solution achieves visual element detection accuracy and caption association accuracy. When used as a preprocessing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
