Waking Up Blind: Cold-Start Optimization of Supervision-Free Agentic Trajectories for Grounded Visual Perception
Ashutosh Bajpai, Tamal Majumder, Akshay Nambi, Tanmoy Chakraborty

TL;DR
This paper introduces SPECTRA, a supervision-free reinforcement learning framework that enhances visual perception agents by grounding reasoning in observations and improving task accuracy and tool efficiency without human labels.
Contribution
The work presents a novel self-supervised approach for training vision-language agents, incorporating structured rollouts and a new metric for tool efficacy, reducing reliance on supervised data.
Findings
SPECTRA improves task accuracy by up to 5%.
Tool efficiency increases by 9%.
Agents learn effectively from environmental interaction alone.
Abstract
Small Vision-Language Models (SVLMs) are efficient task controllers but often suffer from visual brittleness and poor tool orchestration. They typically require expensive supervised trajectory tuning to mitigate these deficits. In this work, we propose Self-supervised Perception Enabled by Cascaded Tool Rollout Alignment (SPECTRA), a supervision-free framework that bootstraps agentic capabilities via Coldstart Reinforcement Learning for SVLMs. SPECTRA enforces Soft Structured Multi-turn Rollouts, a topological constraint that directs agents to explicitly sequence tool derived evidence before synthesis, effectively grounding reasoning in visual observations. We employ a multi-objective reward signal that simultaneously maximizes task correctness, rollout structure, and tool utility, enabling agent to self-discover robust behaviors without human preference labels. We further introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
