Unsupervised Document and Template Clustering using Multimodal Embeddings
Phillipe R. Sampaio, Helene Maxcici

TL;DR
This paper presents a versatile, model-agnostic approach for unsupervised document clustering using multimodal embeddings, evaluated across diverse document types and conditions, highlighting modality-specific strengths and robustness trade-offs.
Contribution
It introduces a systematic pipeline for multimodal document clustering with comprehensive evaluation and a reproducible tuning protocol, advancing unsupervised document organization methods.
Findings
Vision features excel on clean pages for template discovery.
Text features are more robust under covariate shift.
Fused multimodal encoders provide the best balance of accuracy and robustness.
Abstract
We study unsupervised clustering of documents at both the category and template levels using frozen multimodal encoders and classical clustering algorithms. We systematize a model-agnostic pipeline that (i) projects heterogeneous last-layer states from text-layout-vision encoders into token-type-aware document vectors and (ii) performs clustering with centroid- or density-based methods, including an HDBSCAN + -NN assignment to eliminate unlabeled points. We evaluate eight encoders (text-only, layout-aware, vision-only, and vision-language) with -Means, DBSCAN, HDBSCAN + -NN, and BIRCH on five corpora spanning clean synthetic invoices, their heavily degraded print-and-scan counterparts, scanned receipts, and real identity and certificate documents. The study reveals modality-specific failure modes and a robustness-accuracy trade-off, with vision features nearly solving template…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Natural Language Processing Techniques · Advanced Text Analysis Techniques
