VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding

Ignacio de Rodrigo; Alvaro J. Lopez-Lopez; Jaime Boal

arXiv:2601.05125·cs.CV·January 9, 2026

VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding

Ignacio de Rodrigo, Alvaro J. Lopez-Lopez, Jaime Boal

PDF

Open Access 1 Datasets

TL;DR

VERSE is a novel methodology that explores visual embedding spaces to identify problematic regions, generate synthetic data, and improve the performance of vision-language models in visually-rich document understanding tasks.

Contribution

It introduces a clustering-guided approach for analyzing and enhancing visual embeddings, enabling targeted data augmentation and performance improvements.

Findings

01

VERSE uncovers visual features linked to errors in document understanding.

02

Retraining with synthetic data improves F1 scores significantly.

03

On-premise models with VERSE match or outperform SaaS solutions.

Abstract

This work introduces VERSE, a methodology for analyzing and improving Vision-Language Models applied to Visually-rich Document Understanding by exploring their visual embedding space. VERSE enables the visualization of latent representations, supporting the assessment of model feasibility. It also facilitates the identification of problematic regions and guides the generation of synthetic data to enhance performance in those clusters. We validate the methodology by training on the synthetic MERIT Dataset and evaluating on its real-world counterpart, MERIT Secret. Results show that VERSE helps uncover the visual features associated with error-prone clusters, and that retraining with samples containing these features substantially boosts F1 performance without degrading generalization. Furthermore, we demonstrate that on-premise models such as Donut and Idefics2, when optimized with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

de-Rodrigo/merit
dataset· 4.5k dl
4.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis