SVG Decomposition for Enhancing Large Multimodal Models Visualization Comprehension: A Study with Floor Plans
Jeongah Lee, Ali Sarvghad

TL;DR
This study investigates how SVG decomposition can improve large multimodal models' understanding of complex floor plans, revealing benefits for spatial tasks but also limitations in reasoning capabilities.
Contribution
It demonstrates the potential and challenges of using SVG-based decomposition to enhance LMMs' spatial visualization comprehension.
Findings
SVG+PNG input improves spatial understanding tasks.
SVG decomposition can hinder complex spatial reasoning.
Limitations exist in LMMs' pathfinding abilities with SVGs.
Abstract
Large multimodal models (LMMs) are increasingly capable of interpreting visualizations, yet they continue to struggle with spatial reasoning. One proposed strategy is decomposition, which breaks down complex visualizations into structured components. In this work, we examine the efficacy of scalable vector graphics (SVGs) as a decomposition strategy for improving LMMs' performance on floor plans comprehension. Floor plans serve as a valuable testbed because they combine geometry, topology, and semantics, and their reliable comprehension has real-world applications, such as accessibility for blind and low-vision individuals. We conducted an exploratory study with three LMMs (GPT-4o, Claude 3.7 Sonnet, and Llama 3.2 11B Vision Instruct) across 75 floor plans. Results show that combining SVG with raster input (SVG+PNG) improves performance on spatial understanding tasks but often hinders…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Data Visualization and Analytics
