Relative Drawing Identification Complexity is Invariant to Modality in Vision-Language Models

Diogo Freitas; Brigt H{\aa}vardstun; C\`esar Ferri; Dar\'io Garigliotti; Jan Arne Telle; Jos\'e Hern\'andez-Orallo

arXiv:2505.10583·cs.CV·August 29, 2025

Relative Drawing Identification Complexity is Invariant to Modality in Vision-Language Models

Diogo Freitas, Brigt H{\aa}vardstun, C\`esar Ferri, Dar\'io Garigliotti, Jan Arne Telle, Jos\'e Hern\'andez-Orallo

PDF

Open Access 3 Reviews

TL;DR

This paper investigates whether the complexity of teaching vision-language models is consistent across different modalities, finding that concept simplicity appears invariant to whether data is represented as images or coordinate traces.

Contribution

It introduces a method to compare teaching complexity across modalities and demonstrates that concept simplicity is an inherent property, not dependent on data representation.

Findings

01

Image-based representations require fewer segments for teaching.

02

Higher accuracy is achieved with image-based data.

03

Concept complexity is similar across modalities, indicating modality invariance.

Abstract

Large language models have become multimodal, and many of them are said to integrate their modalities using common representations. If this were true, a drawing of a car as an image, for instance, should map to a similar area in the latent space as a textual description of the strokes that form the drawing. To explore this in a black-box access regime to these models, we propose the use of machine teaching, a theory that studies the minimal set of examples a teacher needs to choose so that the learner captures the concept. In this paper, we evaluate the complexity of teaching vision-language models a subset of objects in the Quick, Draw! dataset using two presentations: raw images as bitmaps and trace coordinates in TikZ format. The results indicate that image-based representations generally require fewer segments and achieve higher accuracy than coordinate-based representations. But,…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

- The evaluation protocol is technically sound. - The findings are interesting.

Weaknesses

- The testing model and dataset are limited: in the experiment, only GPT-4V is considered as the model for testing, and the test set is limited to 20 concepts from a specific dataset. Given the variety of multimodal LLMs, including both open-source and proprietary models, the reviewer suggests testing additional models, especially advanced open-source models, to further verify the findings and demonstrate the effectiveness of the proposed protocol. - Potential applications and impact are unclea

Reviewer 02Rating 3Confidence 3

Strengths

The approach of machine teaching to explore modality-invariant concept complexity is interesting, particularly in comparing GPT-4V's handling of bitmap and coordinate-based representations.

Weaknesses

Most importantly, the practicality of understanding concepts explored in this paper cannot be generalized into the real-world settings, and the findings are not insightful beyond the concept understanding comparison between the image representation and stroke coordinates.

Reviewer 03Rating 3Confidence 3

Strengths

- The premise of the paper, or the idea of comparing how vision-language models process analogous image vs. text inputs, is quite interesting and novel.

Weaknesses

- I would like to see some discussion of data filtering / quality checking of the evaluation set, given that RDP is an automated algorithm. - Is it possible to conduct an experiment without any RDP simplified images, e.g., for a given concept simply sampling drawings with different numbers of segments from the Quick, Draw! dataset? - The paper mainly studies the identification accuracy of GPT4-V stratified by different factors (e.g., concept class, modality, level of complexity, relative ran

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Text Readability and Simplification

MethodsSparse Evolutionary Training