Exploring Primitive Visual Measurement Understanding and the Role of Output Format in Learning in Vision-Language Models
Ankit Yadav, Lingqiao Liu, Yuankai Qi

TL;DR
This paper evaluates vision-language models' ability to understand and measure primitive shapes, emphasizing the impact of output format and loss scaling on their spatial reasoning and out-of-domain generalization.
Contribution
It introduces a benchmark for primitive shape understanding and shows that sentence outputs and loss scaling improve model performance and generalization in spatial tasks.
Findings
Sentence outputs outperform tuple formats in out-of-domain scenarios.
Scaling numeric tokens enhances numerical approximation capabilities.
Output format and loss strategies significantly impact model generalization.
Abstract
This work investigates the capabilities of current vision-language models (VLMs) in visual understanding and attribute measurement of primitive shapes using a benchmark focused on controlled 2D shape configurations with variations in spatial positioning, occlusion, rotation, size, and shape attributes such as type, quadrant, center-coordinates, rotation, occlusion status, and color as shown in Figure 1 and supplementary Figures S3-S81. We fine-tune state-of-the-art VLMs (2B-8B parameters) using Low-Rank Adaptation (LoRA) and validate them on multiple out-of-domain (OD) scenarios from our proposed benchmark. Our findings reveal that coherent sentence-based outputs outperform tuple formats, particularly in OD scenarios with large domain gaps. Additionally, we demonstrate that scaling numeric tokens during loss computation enhances numerical approximation capabilities, further improving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTechnology-Enhanced Education Studies · Visual and Cognitive Learning Processes · Language, Metaphor, and Cognition
