Robust PDF Document Conversion Using Recurrent Neural Networks
Nikolaos Livathinos (1), Cesar Berrospi (1), Maksym Lysak (1), Viktor, Kuropiatnyk (1), Ahmed Nassar (1), Andre Carvalho (1), Michele Dolfi (1),, Christoph Auer (1), Kasper Dinkla (1), Peter Staar (1) ((1) IBM Research)

TL;DR
This paper introduces a recurrent neural network approach for detailed and efficient PDF document structure recovery directly from low-level PDF data, outperforming visual methods in accuracy, cost, and scalability.
Contribution
It presents a novel RNN-based method that classifies PDF printing commands for structure detection, enabling finer labels and better cross-page text flow handling.
Findings
Achieved 97% weighted F1 score across 17 labels.
Reduced memory and computational requirements compared to visual methods.
Deployed successfully in production for large-scale COVID-19 PDF analysis.
Abstract
The number of published PDF documents has increased exponentially in recent decades. There is a growing need to make their rich content discoverable to information retrieval tools. In this paper, we present a novel approach to document structure recovery in PDF using recurrent neural networks to process the low-level PDF data representation directly, instead of relying on a visual re-interpretation of the rendered PDF page, as has been proposed in previous literature. We demonstrate how a sequence of PDF printing commands can be used as input into a neural network and how the network can learn to classify each printing command according to its structural function in the page. This approach has three advantages: First, it can distinguish among more fine-grained labels (typically 10-20 labels as opposed to 1-5 with visual methods), which results in a more accurate and detailed document…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methodstravel james
