Relative Drawing Identification Complexity is Invariant to Modality in Vision-Language Models
Diogo Freitas, Brigt H{\aa}vardstun, C\`esar Ferri, Dar\'io Garigliotti, Jan Arne Telle, Jos\'e Hern\'andez-Orallo

TL;DR
This paper investigates whether the complexity of teaching vision-language models is consistent across different modalities, finding that concept simplicity appears invariant to whether data is represented as images or coordinate traces.
Contribution
It introduces a method to compare teaching complexity across modalities and demonstrates that concept simplicity is an inherent property, not dependent on data representation.
Findings
Image-based representations require fewer segments for teaching.
Higher accuracy is achieved with image-based data.
Concept complexity is similar across modalities, indicating modality invariance.
Abstract
Large language models have become multimodal, and many of them are said to integrate their modalities using common representations. If this were true, a drawing of a car as an image, for instance, should map to a similar area in the latent space as a textual description of the strokes that form the drawing. To explore this in a black-box access regime to these models, we propose the use of machine teaching, a theory that studies the minimal set of examples a teacher needs to choose so that the learner captures the concept. In this paper, we evaluate the complexity of teaching vision-language models a subset of objects in the Quick, Draw! dataset using two presentations: raw images as bitmaps and trace coordinates in TikZ format. The results indicate that image-based representations generally require fewer segments and achieve higher accuracy than coordinate-based representations. But,…
Peer Reviews
Decision·Submitted to ICLR 2025
- The evaluation protocol is technically sound. - The findings are interesting.
- The testing model and dataset are limited: in the experiment, only GPT-4V is considered as the model for testing, and the test set is limited to 20 concepts from a specific dataset. Given the variety of multimodal LLMs, including both open-source and proprietary models, the reviewer suggests testing additional models, especially advanced open-source models, to further verify the findings and demonstrate the effectiveness of the proposed protocol. - Potential applications and impact are unclea
The approach of machine teaching to explore modality-invariant concept complexity is interesting, particularly in comparing GPT-4V's handling of bitmap and coordinate-based representations.
Most importantly, the practicality of understanding concepts explored in this paper cannot be generalized into the real-world settings, and the findings are not insightful beyond the concept understanding comparison between the image representation and stroke coordinates.
- The premise of the paper, or the idea of comparing how vision-language models process analogous image vs. text inputs, is quite interesting and novel.
- I would like to see some discussion of data filtering / quality checking of the evaluation set, given that RDP is an automated algorithm. - Is it possible to conduct an experiment without any RDP simplified images, e.g., for a given concept simply sampling drawings with different numbers of segments from the Quick, Draw! dataset? - The paper mainly studies the identification accuracy of GPT4-V stratified by different factors (e.g., concept class, modality, level of complexity, relative ran
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Text Readability and Simplification
MethodsSparse Evolutionary Training
