TechING: Towards Real World Technical Image Understanding via VLMs
Tafazzul Nadeem, Bhavik Shangari, Manish Rai, Gagan Raj Gupta, Ashutosh Modi

TL;DR
This paper presents a method to improve visual language models' understanding of technical diagrams by training on synthetic data and introducing new self-supervision tasks, leading to significant performance gains on real-world hand-drawn images.
Contribution
The authors create a large synthetic dataset and develop new self-supervision tasks to enhance VLMs' ability to understand technical diagrams, especially hand-drawn ones.
Findings
Significant improvement in ROUGE-L scores after fine-tuning on synthetic images.
Achieved minimal compilation errors across most diagram types in human evaluations.
Enhanced F1 score of Llama 3.2 11B-instruct by nearly 7 times on real-world images.
Abstract
Professionals working in technical domain typically hand-draw (on whiteboard, paper, etc.) technical diagrams (e.g., flowcharts, block diagrams, etc.) during discussions; however, if they want to edit these later, it needs to be drawn from scratch. Modern day VLMs have made tremendous progress in image understanding but they struggle when it comes to understanding technical diagrams. One way to overcome this problem is to fine-tune on real world hand-drawn images, but it is not practically possible to generate large number of such images. In this paper, we introduce a large synthetically generated corpus (reflective of real world images) for training VLMs and subsequently evaluate VLMs on a smaller corpus of hand-drawn images (with the help of humans). We introduce several new self-supervision tasks for training and perform extensive experiments with various baseline models and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Data Visualization and Analytics
