DAVE: A VLM Vision Encoder for Document Understanding and Web Agents
Brandon Huang, Hang Hua, Zhuoran Yu, Trevor Darrell, Rogerio Feris, Roei Herzig

TL;DR
DAVE is a specialized vision encoder for document understanding and web agents, trained with self-supervised and supervised methods, combining multiple encoders to improve performance on diverse tasks.
Contribution
We introduce DAVE, a novel vision encoder designed specifically for document and web tasks, utilizing a hybrid training pipeline and encoder merging strategies.
Findings
DAVE outperforms existing encoders on document and web benchmarks.
The model-merging scheme enhances compatibility across web agent architectures.
Ensemble training improves feature robustness and task performance.
Abstract
While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder's alignment with both general visual…
Peer Reviews
Decision·ICLR 2026 Poster
The breadth of evaluations and ablations is a key strength. [1] Thorough benchmarking on Document understanding, general VQA, and Web understanding benchmarks: We see evaluation across a wide variety of benchmarks and baseline models. Using both Qwen-2.5-7B-Instruct and Llama-3.2-3B-Instruct decoders further solidify the alignment on different text decoders [2] Very strong improvements on Mind2Web (Table 2) [3] Clean ablation setups are useful to justify the training setup. Table 4 c,d are esp
[1] average perfomance is used across all results which might under or over estimate the performance. Having some best of N results using some confidence or majority voting would further help understand the variance of the model performance better. [2] there might be potential domain bias, example overfitting to synthetic chart styles in PlotQA or financial domain layouts in FinTabNet. Might have been useful to detect and discuss such biases.
1. The paper proposes a modification to the standard MAE objective by reconstructing raw pixel values directly, rather than normalized pixel values. This aims to stabilize training, particularly for document and web images which exhibit low inter-patch variance. While further empirical evidence directly linking this change to stability would strengthen the claim, the approach itself is a thoughtful adaptation to domain-specific data characteristics. 2. A significant strength is the introduction
1. The core hypoththesis of this paper is that current VLM vision encoders "lack the robust structural and spatial information essential for document understanding and web agents." However, this assertion is made without references or experimental evidence. Given the existence of numerous VLMs works for document understanding, a more detailed explanation is needed to clarify why these models specifically lack the required powerful structural and spatial information. The observed performance gai
* The paper is clearly written and easy to follow. * The research question of vision encoder for document images is important.
* The contribution of the paper is limited. The training methods employed in the paper including self-supervised pre-training, model merging are well-established ideas such as MAE self-supervised learning [1], model soups [2]. * Although the experiments encompass a wide range of benchmarks (e.g., DocVQA, ChartQA, Mind2Web), the paper provides limited analysis of cross-domain generalization—such as performance on non-English documents—as well as robustness and interpretability aspects. [1] Mask
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis
