TabFlash: Efficient Table Understanding with Progressive Question Conditioning and Token Focusing
Jongha Kim, Minseong Bae, Sanghyeok Lee, Jinsung Yoon, Hyunwoo J. Kim

TL;DR
TabFlash introduces a novel approach for table image understanding by progressively conditioning questions, pruning background tokens, and focusing on essential tokens, resulting in state-of-the-art performance with reduced computational costs.
Contribution
The paper presents TabFlash, a new multimodal model that enhances table understanding through progressive question conditioning, token pruning, and focusing strategies, improving efficiency and effectiveness.
Findings
Achieves state-of-the-art performance on table understanding tasks.
Uses 27% fewer FLOPs and 30% less memory than previous models.
Effectively reduces redundancy and retains essential information.
Abstract
Table images present unique challenges for effective and efficient understanding due to the need for question-specific focus and the presence of redundant background regions. Existing Multimodal Large Language Model (MLLM) approaches often overlook these characteristics, resulting in uninformative and redundant visual representations. To address these issues, we aim to generate visual features that are both informative and compact to improve table understanding. We first propose progressive question conditioning, which injects the question into Vision Transformer layers with gradually increasing frequency, considering each layer's capacity to handle additional information, to generate question-aware visual features. To reduce redundancy, we introduce a pruning strategy that discards background tokens, thereby improving efficiency. To mitigate information loss from pruning, we further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
