VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding
Ignacio de Rodrigo, Alvaro J. Lopez-Lopez, Jaime Boal

TL;DR
VERSE is a novel methodology that explores visual embedding spaces to identify problematic regions, generate synthetic data, and improve the performance of vision-language models in visually-rich document understanding tasks.
Contribution
It introduces a clustering-guided approach for analyzing and enhancing visual embeddings, enabling targeted data augmentation and performance improvements.
Findings
VERSE uncovers visual features linked to errors in document understanding.
Retraining with synthetic data improves F1 scores significantly.
On-premise models with VERSE match or outperform SaaS solutions.
Abstract
This work introduces VERSE, a methodology for analyzing and improving Vision-Language Models applied to Visually-rich Document Understanding by exploring their visual embedding space. VERSE enables the visualization of latent representations, supporting the assessment of model feasibility. It also facilitates the identification of problematic regions and guides the generation of synthetic data to enhance performance in those clusters. We validate the methodology by training on the synthetic MERIT Dataset and evaluating on its real-world counterpart, MERIT Secret. Results show that VERSE helps uncover the visual features associated with error-prone clusters, and that retraining with samples containing these features substantially boosts F1 performance without degrading generalization. Furthermore, we demonstrate that on-premise models such as Donut and Idefics2, when optimized with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
