O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model
Rishi Gupta, Mukilan Karuppasamy, Shyam Marjit, Aditay Tripathi, Anirban Chakraborty

TL;DR
O3SLM is a large vision-language model trained on a new extensive dataset of sketches, images, and instructions, significantly improving sketch understanding and reasoning in various visual tasks.
Contribution
The paper introduces a novel large-scale dataset of sketch-image-instruction triplets and a new LVLM, O3SLM, trained on this dataset for enhanced sketch comprehension.
Findings
Achieves state-of-the-art results on sketch-based tasks
Outperforms existing LVLMs in sketch reasoning
Demonstrates strong generalization across multiple sketch datasets
Abstract
While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality that offers an intuitive means of expressing concepts that are difficult to describe textually. We identify the primary bottleneck as the absence of a large-scale dataset that jointly models sketches, photorealistic images, and corresponding natural language instructions. To address this, we present two key contributions: (1) a new, large-scale dataset of image-sketch-instruction triplets designed to facilitate both pretraining and instruction tuning, and (2) O3SLM, an LVLM trained on this dataset. Comprehensive evaluations on multiple sketch-based tasks: (a) object localization, (b) counting, (c) image retrieval i.e., (SBIR and fine-grained SBIR), and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
