Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding
Haruka Kawasaki, Ryota Tanaka, Kyosuke Nishida

TL;DR
This paper investigates the discrepancy between internal representations and responses in visual document understanding models, revealing that important information is often encoded in intermediate layers and can be improved through targeted fine-tuning.
Contribution
It uncovers the gap between internal representations and responses in LVLMs for VDU and proposes fine-tuning strategies for intermediate layers to enhance performance.
Findings
Internal representations often encode task-relevant information more linearly than final responses.
Fine-tuning intermediate layers improves both probing accuracy and response accuracy.
Addressing the internal-response gap enhances VDU model performance.
Abstract
Visual document understanding (VDU) is a challenging task for large vision language models (LVLMs), requiring the integration of visual perception, text recognition, and reasoning over structured layouts. Although recent LVLMs have shown progress on VDU benchmarks, their performance is typically evaluated based on generated responses, which may not necessarily reflect whether the model has actually captured the required information internally. In this paper, we investigate how information required to solve VDU tasks is represented across different layers of LLMs within LVLMs using linear probing. Our study reveals that (1) there is a clear gap between internal representations and generated responses, and (2) information required to solve the task is often encoded more linearly from intermediate layers than from the final layer. Motivated by these findings, we explore fine-tuning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
