Overcoming Vision Language Model Challenges in Diagram Understanding: A Proof-of-Concept with XML-Driven Large Language Models Solutions
Shue Shiinoki, Ryo Koshihara, Hayato Motegi, Masumi Morishige

TL;DR
This paper presents a novel text-driven approach using source files like xlsx, pptx, and docx to enable large language models to understand diagrams more accurately than traditional vision-language models, improving business documentation analysis.
Contribution
It introduces a method that bypasses visual recognition by extracting textual metadata from source files to enhance diagram understanding with LLMs.
Findings
Text-driven approach outperforms VLM-based methods in accuracy
Method applicable to various document formats like pptx and docx
Enables more robust diagram analysis in business contexts
Abstract
Diagrams play a crucial role in visually conveying complex relationships and processes within business documentation. Despite recent advances in Vision-Language Models (VLMs) for various image understanding tasks, accurately identifying and extracting the structures and relationships depicted in diagrams continues to pose significant challenges. This study addresses these challenges by proposing a text-driven approach that bypasses reliance on VLMs' visual recognition capabilities. Instead, it utilizes the editable source files--such as xlsx, pptx or docx--where diagram elements (e.g., shapes, lines, annotations) are preserved as textual metadata. In our proof-of-concept, we extracted diagram information from xlsx-based system design documents and transformed the extracted shape data into textual input for Large Language Models (LLMs). This approach allowed the LLM to analyze…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Model-Driven Software Engineering Techniques · AI-based Problem Solving and Planning
