Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance

Yingjie Zhu; Xuefeng Bai; Kehai Chen; Yang Xiang; Youcheng Pan; Xiaoqiang Zhou; Min Zhang

arXiv:2602.03491·cs.CV·February 4, 2026

Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance

Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Youcheng Pan, Xiaoqiang Zhou, Min Zhang

PDF

Open Access

TL;DR

This paper introduces DiSCo and Table-GLS, a framework that efficiently improves large vision-language models' ability to reason over complex table images by disentangling structure and content without heavy supervision or external tools.

Contribution

The work proposes a novel disentangled alignment framework and a structure-guided reasoning method to enhance table reasoning in LVLMs with minimal annotation and no external tools.

Findings

01

Significant performance improvements on diverse table reasoning benchmarks.

02

Effective generalization to unseen table structures.

03

Reduced reliance on supervised data and external tools.

Abstract

Reasoning over table images remains challenging for Large Vision-Language Models (LVLMs) due to complex layouts and tightly coupled structure-content information. Existing solutions often depend on expensive supervised training, reinforcement learning, or external tools, limiting efficiency and scalability. This work addresses a key question: how to adapt LVLMs to table reasoning with minimal annotation and no external tools? Specifically, we first introduce DiSCo, a Disentangled Structure-Content alignment framework that explicitly separates structural abstraction from semantic grounding during multimodal alignment, efficiently adapting LVLMs to tables structures. Building on DiSCo, we further present Table-GLS, a Global-to-Local Structure-guided reasoning framework that performs table reasoning via structured exploration and evidence-grounded inference. Extensive experiments across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Topic Modeling