Testing the Limits of Fine-Tuning for Improving Visual Cognition in Vision Language Models

Luca M. Schulze Buschoff; Konstantinos Voudouris; Elif Akata; Matthias Bethge; Joshua B. Tenenbaum; Eric Schulz

arXiv:2502.15678·cs.LG·June 2, 2025

Testing the Limits of Fine-Tuning for Improving Visual Cognition in Vision Language Models

Luca M. Schulze Buschoff, Konstantinos Voudouris, Elif Akata, Matthias Bethge, Joshua B. Tenenbaum, Eric Schulz

PDF

TL;DR

This paper evaluates the limits of fine-tuning vision language models to enhance visual cognition and human alignment, revealing domain-specific improvements but limited generalization across different tasks and visual features.

Contribution

It introduces a systematic evaluation framework with visual stimuli and human judgments, and analyzes how fine-tuning on specific cognitive tasks affects model performance and generalization.

Findings

01

Fine-tuning improves performance in targeted cognitive domains.

02

Fine-tuning enhances alignment with human behavior in those domains.

03

Limited generalization to other visual features and cognitive tasks.

Abstract

Pre-trained vision language models still fall short of human visual cognition. In an effort to improve visual cognition and align models with human behavior, we introduce visual stimuli and human judgments on visual cognition tasks, allowing us to systematically evaluate performance across cognitive domains under a consistent environment. We fine-tune models on ground truth data for intuitive physics and causal reasoning and find that this improves model performance in the respective fine-tuning domain. Furthermore, it can improve model alignment with human behavior. However, we find that task-specific fine-tuning does not contribute to robust human-like generalization to data with other visual characteristics or to tasks in other cognitive domains.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsALIGN