TL;DR
This paper introduces a diagnostic framework using the Linear Separability Ceiling to analyze VLMs, revealing an alignment gap and proposing a contrastive training method to improve their abstract reasoning capabilities.
Contribution
The paper presents the LSC framework for diagnosing VLM limitations and proposes a contrastive objective to enhance visual alignment and reasoning.
Findings
Most models fail to outperform the linear separability ceiling.
Models that surpass the ceiling do so via better representations or non-linear logic.
Contrastive training improves models' ability on abstract reasoning tasks.
Abstract
A challenge in advancing Visual-Language Models (VLMs) is determining whether their failures on abstract reasoning tasks, such as Bongard problems, stem from flawed perception or faulty top-down reasoning. To disentangle these factors, we introduce a diagnostic framework centered on the Linear Separability Ceiling (LSC), the performance achievable by a linear classifier on a VLM's raw visual embeddings. Applying this framework to state-of-the-art VLMs, we uncover a pervasive ``alignment gap'', where most models fail to generatively outperform the linear separability of their representations. We find that the few models surpassing this ceiling do so via two mechanisms: by further refining visual representations into a more linearly separable format or by executing non-linear decision logic. We demonstrate that this bottleneck is not a fundamental limitation but a solvable visual…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper is well motivated and identifies an important problem: diagnosing perceptual–reasoning misalignment in VLMs. 2. The proposed LSC is a clear, interpretable metric that operationalizes representational quality in a meaningful way. 3. The paper is well written and organized, with 2 dataset on 8 models coupled with various promptings. 4. The paper introduced a contrastive fine-tuning objective that simultaneously improves generative accuracy and final-layer separability.
1. The paper defines “non-linearity” as “cases where linear probes fail.” The non-linearity can be an artifact of your measurement of cosine similarity of euclidean averaged embeddings, not a measured representational property. The claim would be stronger with direct evidence of curvature or manifold structure. 2. Important evaluation details are underspecified—for instance, how generative accuracy is computed relative to the probe-based classification accuracy. 3. Some of the results (e.g., S
1) The paper introduces the Linear Separability Ceiling framework to disentangle perception and reasoning in VLMs. 2) Through large-scale analysis, the authors reveal a pervasive alignment gap: most VLMs fail to outperform their own LSC, highlighting a fundamental but previously unmeasured bottleneck in vision–language reasoning. 3) The approach attains or surpasses human-level accuracy on OpenWorld and narrows the gap on HOI reasoning, demonstrating that the limitation in current VLMs stems fro
1) While the Linear Separability Ceiling is intuitively defined, the paper lacks a rigorous theoretical justification for why linear separability should represent the upper bound of perceptual quality. A more formal link between LSC and model capacity or information-theoretic limits is missing. 2) The claim that failures arise from “alignment gaps” rather than perception deficits is mostly correlational. The experiments show association but not causal evidence that reasoning misalignment causes
1. The graphs and tables are clear and easy to understand. 2. Experiments are thorough, covering multiple VLMs, datasets, PEFT methods, objectives, and generalization scenarios.
1. While effective, the LSC relies solely on linear separability. It's possible that representations hold complex non-linear structures useful for reasoning that the LSC metric fails to capture. 2. The core observation that VLM generative performance often fails to surpass a linear probe on its visual features is not entirely new. Similar gaps between representation quality and end-to-end performance have been previously studied, showing VLMs can underperform linear probes on classification or g
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
