Beyond End-to-End VLMs: Leveraging Intermediate Text Representations for   Superior Flowchart Understanding

Junyi Ye; Ankan Dash; Wenpeng Yin; Guiling Wang

arXiv:2412.16420·cs.CV·December 24, 2024

Beyond End-to-End VLMs: Leveraging Intermediate Text Representations for Superior Flowchart Understanding

Junyi Ye, Ankan Dash, Wenpeng Yin, Guiling Wang

PDF

Open Access 1 Repo 1 Video

TL;DR

TextFlow introduces a modular, text-based approach to flowchart understanding, enhancing controllability, explainability, and performance over traditional end-to-end vision-language models.

Contribution

It proposes a two-stage framework that leverages textual representations for flowchart analysis, addressing controllability and explainability issues in existing VLM-based methods.

Findings

01

Achieves state-of-the-art results on FlowVQA and FlowLearn benchmarks.

02

Enhances error attribution to visual or textual components.

03

Improves robustness and user control in flowchart understanding.

Abstract

Flowcharts are typically presented as images, driving the trend of using vision-language models (VLMs) for end-to-end flowchart understanding. However, two key challenges arise: (i) Limited controllability--users have minimal influence over the downstream task, as they can only modify input images, while the training of VLMs is often out of reach for most researchers. (ii) Lack of explainability--it is difficult to trace VLM errors to specific causes, such as failures in visual encoding or reasoning. We propose TextFlow, addressing aforementioned issues with two stages: (i) Vision Textualizer--which generates textual representations from flowchart images; and (ii) Textual Reasoner--which performs question-answering based on the text representations. TextFlow offers three key advantages: (i) users can select the type of text representations (e.g., Graphviz, Mermaid, PlantUML), or further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

junyiye/textflow
noneOfficial

Videos

Beyond End-to-End VLMs: Leveraging Intermediate Text Representations for Superior Flowchart Understanding· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques