Leveraging Semantic Segmentation Masks with Embeddings for Fine-Grained Form Classification
Taylor Archibald, Tony Martinez

TL;DR
This paper introduces a novel approach combining semantic segmentation with deep learning embeddings to improve fine-grained, unsupervised classification of historical document forms, demonstrating significant accuracy improvements.
Contribution
It is the first to evaluate embeddings on fine-grained, unsupervised form classification and proposes using semantic segmentation as a preprocessing step to enhance embedding quality.
Findings
Semantic segmentation improves clustering accuracy.
Embeddings effectively distinguish similar document types.
Proposed method outperforms baseline approaches.
Abstract
Efficient categorization of historical documents is crucial for fields such as genealogy, legal research, and historical scholarship, where manual classification is impractical for large collections due to its labor-intensive and error-prone nature. To address this, we propose a representational learning strategy that integrates semantic segmentation and deep learning models such as ResNet, CLIP, Document Image Transformer (DiT), and masked auto-encoders (MAE), to generate embeddings that capture document features without predefined labels. To the best of our knowledge, we are the first to evaluate embeddings on fine-grained, unsupervised form classification. To improve these embeddings, we propose to first employ semantic segmentation as a preprocessing step. We contribute two novel datasetsthe French 19th-century and U.S. 1950 Census recordsto…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · 3D Surveying and Cultural Heritage · Handwritten Text Recognition Techniques
MethodsAttention Is All You Need · Kaiming Initialization · Max Pooling · Average Pooling · Global Average Pooling · Linear Layer · Position-Wise Feed-Forward Layer · Convolution · Multi-Head Attention · Residual Connection
