TL;DR
This paper presents a novel data-mining pipeline that combines layout analysis and OCR to extract structured, geocoded spatio-temporal data from printed registries, enabling detailed analysis of historical industrialization.
Contribution
The authors develop an integrated method for digitizing and structuring printed socioenvironmental data, facilitating new insights into historical industrial land use patterns.
Findings
Dispersal of manufacturing from Providence's urban core along I-95.
High-resolution spatio-temporal data enables detailed socioenvironmental analysis.
Method successfully extracts structured data from scanned printed directories.
Abstract
Despite the growing availability of big data in many fields, historical data on socioevironmental phenomena are often not available due to a lack of automated and scalable approaches for collecting, digitizing, and assembling them. We have developed a data-mining method for extracting tabulated, geocoded data from printed directories. While scanning and optical character recognition (OCR) can digitize printed text, these methods alone do not capture the structure of the underlying data. Our pipeline integrates both page layout analysis and OCR to extract tabular, geocoded data from structured text. We demonstrate the utility of this method by applying it to scanned manufacturing registries from Rhode Island that record 41 years of industrial land use. The resulting spatio-temporal data can be used for socioenvironmental analyses of industrialization at a resolution that was not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
