Inference-Time Structural Reasoning for Compositional Vision-Language Understanding

Amartya Bhattacharya

arXiv:2603.27349·cs.CV·March 31, 2026

Inference-Time Structural Reasoning for Compositional Vision-Language Understanding

Amartya Bhattacharya

PDF

1 Repo

TL;DR

This paper introduces a framework for evaluating and enhancing vision-language models' ability to perform compositional reasoning, using structural priors and scene graph parsing to improve accuracy on the Winoground benchmark.

Contribution

It presents a unified evaluation and augmentation framework that incorporates structural relational priors via scene graph parsing to improve compositional reasoning in VLMs.

Findings

01

Qwen3-VL-8B-Thinking achieves a group score of 62.75, surpassing other models.

02

Multi-turn scene graph filtering raises the score to 66.0, surpassing prior state-of-the-art.

03

Scene graph augmentation benefits capable models but not weaker baselines.

Abstract

Vision-language models (VLMs) excel at image-text retrieval yet persistently fail at compositional reasoning, distinguishing captions that share the same words but differ in relational structure. We present, a unified evaluation and augmentation framework benchmarking four architecturally diverse VLMs,CLIP, BLIP, LLaVA, and Qwen3-VL-8B-Thinking,on the Winoground benchmark under plain and scene-graph-augmented regimes. We introduce a dependency-based TextSceneGraphParser (spaCy) extracting subject-relation-object triples, and a Graph Asymmetry Scorer using optimal bipartite matching to inject structural relational priors. Caption ablation experiments (subject-object masking and swapping) reveal that Qwen3-VL-8B-Thinking achieves a group score of 62.75, far above all encoder-based models, while a proposed multi-turn SG filtering strategy further lifts it to 66.0, surpassing prior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amartyacodes/Inference-Time-Structural-Reasoning-for-Compositional-Vision-Language-Understanding
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.