OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection
Chujie Wang, Jianyu Lu, Zhiyuan Luo, Xi Chen, Chu He

TL;DR
OVOD-Agent introduces a proactive, self-evolving detection framework using a Markov-Bandit approach to improve open-vocabulary object detection, especially on rare categories.
Contribution
It proposes a novel Markov-Bandit framework with Visual-CoT and self-supervised learning for enhanced proactive visual reasoning in OVOD.
Findings
Consistent improvements on COCO and LVIS datasets.
Significant gains on rare categories.
Effective integration of Markov transition matrices with Bandit exploration.
Abstract
Open-Vocabulary Object Detection (OVOD) aims to enable detectors to generalize across categories by leveraging semantic information. Although existing methods are pretrained on large vision-language datasets, their inference is still limited to fixed category names, creating a gap between multimodal training and unimodal inference. Previous work has shown that improving textual representation can significantly enhance OVOD performance, indicating that the textual space is still underexplored. To this end, we propose OVOD-Agent, which transforms passive category matching into proactive visual reasoning and self-evolving detection. Inspired by the Chain-of-Thought (CoT) paradigm, OVOD-Agent extends the textual optimization process into an interpretable Visual-CoT with explicit actions. OVOD's lightweight nature makes LLM-based management unsuitable; instead, we model visual context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
