OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection

Chujie Wang; Jianyu Lu; Zhiyuan Luo; Xi Chen; Chu He

arXiv:2511.21064·cs.AI·April 21, 2026

OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection

Chujie Wang, Jianyu Lu, Zhiyuan Luo, Xi Chen, Chu He

PDF

TL;DR

OVOD-Agent introduces a proactive, self-evolving detection framework using a Markov-Bandit approach to improve open-vocabulary object detection, especially on rare categories.

Contribution

It proposes a novel Markov-Bandit framework with Visual-CoT and self-supervised learning for enhanced proactive visual reasoning in OVOD.

Findings

01

Consistent improvements on COCO and LVIS datasets.

02

Significant gains on rare categories.

03

Effective integration of Markov transition matrices with Bandit exploration.

Abstract

Open-Vocabulary Object Detection (OVOD) aims to enable detectors to generalize across categories by leveraging semantic information. Although existing methods are pretrained on large vision-language datasets, their inference is still limited to fixed category names, creating a gap between multimodal training and unimodal inference. Previous work has shown that improving textual representation can significantly enhance OVOD performance, indicating that the textual space is still underexplored. To this end, we propose OVOD-Agent, which transforms passive category matching into proactive visual reasoning and self-evolving detection. Inspired by the Chain-of-Thought (CoT) paradigm, OVOD-Agent extends the textual optimization process into an interpretable Visual-CoT with explicit actions. OVOD's lightweight nature makes LLM-based management unsuitable; instead, we model visual context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.