See then Tell: Enhancing Key Information Extraction with Vision Grounding

Shuhang Liu; Zhenrong Zhang; Pengfei Hu; Jiefeng Ma; Jun Du; Qing Wang; Jianshu Zhang; Chenyu Liu

arXiv:2409.19573·cs.CV·August 29, 2025

See then Tell: Enhancing Key Information Extraction with Vision Grounding

Shuhang Liu, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Jun Du, Qing Wang, Jianshu Zhang, Chenyu Liu

PDF

Open Access

TL;DR

This paper introduces STNet, an end-to-end model that enhances key information extraction from visually rich documents by integrating vision grounding with text answers, significantly improving accuracy over traditional OCR-based methods.

Contribution

The paper presents STNet, a novel model utilizing a <see> token for vision grounding in key information extraction, and introduces the TVG dataset created with GPT-4 for training and evaluation.

Findings

01

Achieves state-of-the-art results on CORD, SROIE, and DocVQA datasets.

02

Demonstrates improved accuracy in key information extraction tasks.

03

Provides a new dataset with vision grounding annotations for table QA.

Abstract

In the digital era, the ability to understand visually rich documents that integrate text, complex layouts, and imagery is critical. Traditional Key Information Extraction (KIE) methods primarily rely on Optical Character Recognition (OCR), which often introduces significant latency, computational overhead, and errors. Current advanced image-to-text approaches, which bypass OCR, typically yield plain text outputs without corresponding vision grounding. In this paper, we introduce STNet (See then Tell Net), a novel end-to-end model designed to deliver precise answers with relevant vision grounding. Distinctively, STNet utilizes a unique <see> token to observe pertinent image areas, aided by a decoder that interprets physical coordinates linked to this token. Positioned at the outset of the answer text, the <see> token allows the model to first see-observing the regions of the image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques · Topic Modeling · Web Data Mining and Analysis

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Layer Normalization · Dense Connections · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding