Reengineering PDF-Based Documents Targeting Complex Software Specifications
Mehrdad Nojoumian, Timothy C. Lethbridge

TL;DR
This paper presents a method to reengineer complex PDF documents, like OMG specifications, into more usable, structured, and navigable formats to improve user experience and facilitate understanding.
Contribution
It introduces a general approach for extracting logical structures from complex PDFs and creating multilayer hypertext versions, applicable beyond just OMG specifications.
Findings
Successfully extracted logical structures in XML format
Created multilayer hypertext for easier navigation
Applicable to various complex document types
Abstract
This article aims at reengineering of PDF-based complex documents, where specifications of the Object Management Group (OMG) are our initial targets. Our motivation is that such specifications are dense and intricate to use, and tend to have complicated structures. Our objective is therefore to create an approach that allows us to reengineer PDF-based documents, and to illustrate how to make more usable versions of electronic documents (such as specifications, technical books, etc) so that end users to have a better experience with them. The first step was to extract the logical structure of the document in a meaningful XML format for subsequent processing. Our initial assumption was that, many key concepts of a document are expressed in this structure. In the next phase, we created a multilayer hypertext version of the document to facilitate browsing and navigating. Although we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
