STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision

Chen Li; Han Zhang; Zhantao Yang; Fangyi Chen; Zihan Wang; Anudeepsekhar Bolimera; Marios Savvides

arXiv:2508.08688·cs.AI·February 11, 2026

STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision

Chen Li, Han Zhang, Zhantao Yang, Fangyi Chen, Zihan Wang, Anudeepsekhar Bolimera, Marios Savvides

PDF

Open Access 3 Datasets 1 Video

TL;DR

STELAR-Vision introduces a topology-aware training framework for vision-language models, enhancing reasoning capabilities and output efficiency by incorporating diverse topological structures and reducing verbosity.

Contribution

The paper presents STELAR-Vision, a novel training framework that integrates topological reasoning structures and frugal output techniques into vision-language models.

Findings

01

Improves accuracy by 9.7% on key benchmarks.

02

Surpasses larger models in accuracy by 7.3%.

03

Outperforms existing methods on multiple OOD benchmarks.

Abstract

Vision-language models (VLMs) have made significant strides in reasoning, yet they often struggle with complex multimodal tasks and tend to generate overly verbose outputs. A key limitation is their reliance on chain-of-thought (CoT) reasoning, despite many tasks benefiting from alternative topologies like trees or graphs. To address this, we introduce STELAR-Vision, a training framework for topology-aware reasoning. At its core is TopoAug, a synthetic data pipeline that enriches training with diverse topological structures. Using supervised fine-tuning and reinforcement learning, we post-train Qwen2VL models with both accuracy and efficiency in mind. Additionally, we propose Frugal Learning, which reduces output length with minimal accuracy loss. On MATH-V and VLM-S2H, STELAR-Vision improves accuracy by 9.7% over its base model and surpasses the larger Qwen2VL-72B-Instruct by 7.3%. On…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks