Predicting the Past: Estimating Historical Appraisals with OCR and Machine Learning

Mihir Bhaskar; Jun Tao Luo; Zihan Geng; Asmita Hajra; Junia Howell; Matthew R. Gormley

arXiv:2505.24676·cs.LG·June 2, 2025

Predicting the Past: Estimating Historical Appraisals with OCR and Machine Learning

Mihir Bhaskar, Jun Tao Luo, Zihan Geng, Asmita Hajra, Junia Howell, Matthew R. Gormley

PDF

1 Repo 1 Datasets

TL;DR

This paper develops a cost-effective method combining OCR, computer vision, and regression models to digitize and analyze historical property appraisal records, enabling better understanding of past housing policies' impacts.

Contribution

It introduces a novel two-stage OCR approach and a regression model for estimating historical property values, facilitating large-scale analysis of inaccessible physical records.

Findings

01

Successfully digitized over 12,000 property records

02

Extended OCR labeling to 50,000 properties using a hybrid approach

03

Demonstrated the regression model's ability to estimate values across counties

Abstract

Despite well-documented consequences of the U.S. government's 1930s housing policies on racial wealth disparities, scholars have struggled to quantify its precise financial effects due to the inaccessibility of historical property appraisal records. Many counties still store these records in physical formats, making large-scale quantitative analysis difficult. We present an approach scholars can use to digitize historical housing assessment data, applying it to build and release a dataset for one county. Starting from publicly available scanned documents, we manually annotated property cards for over 12,000 properties to train and validate our methods. We use OCR to label data for an additional 50,000 properties, based on our two-stage approach combining classical computer vision techniques with deep learning-based OCR. For cases where OCR cannot be applied, such as when scanned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

juntaoluo/erukaexp
noneOfficial

Datasets

eruka-cmu-housing/historical-appraisals-ocr-ml
dataset· 97 dl
97 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.