Unlocking UML Class Diagram Understanding in Vision Language Models
Artem Naboichenko, Ren\'e Peinl

TL;DR
This paper introduces a new benchmark and dataset for UML class diagram question answering, demonstrating that fine-tuning vision language models can significantly improve their understanding of such diagrams.
Contribution
It presents a novel benchmark and a large-scale dataset for UML diagram understanding, along with a fine-tuning approach that outperforms existing models.
Findings
Fine-tuning with LoRA improves UML diagram understanding.
A dataset of 16,000 image-question-answer triples was created.
Fine-tuned models outperform Qwen 3.5 27B on this task.
Abstract
Although Vision Language Models (VLMs) have seen tremendous progress across all kinds of use cases, they still fall behind in answering questions regard-ing diagrams compared to photos. Although progress has been made in the area of bar charts, line charts and other diagrams like that there is still few research concerned with other types of diagrams, e.g. in the computer science domain. Our work presents a benchmark for visual question answering based on UML class diagrams which is both challenging and manageable. We further construct a large-scale training dataset with 16.000 image-question-answer triples and show that a LoRA-based finetune easily outperforms Qwen 3.5 27B, which is a recent and well-performing VLM in many other benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
