TL;DR
This paper introduces Top-Down Semantic Refinement (TDSR), a hierarchical planning framework for image captioning that improves global coherence and detail capture by modeling caption generation as a Markov Decision Process with efficient Monte Carlo Tree Search.
Contribution
It proposes a novel hierarchical refinement framework and an efficient MCTS algorithm tailored for vision-language models, significantly reducing computational costs while enhancing caption quality.
Findings
TDSR improves caption detail and coherence.
State-of-the-art results on multiple benchmarks.
Reduces VLM call frequency by an order of magnitude.
Abstract
Large Vision-Language Models (VLMs) face an inherent contradiction in image captioning: their powerful single-step generation capabilities often lead to a myopic decision-making process. This makes it difficult to maintain global narrative coherence while capturing rich details, a limitation that is particularly pronounced in tasks that require multi-step and complex scene description. To overcome this fundamental challenge, we redefine image captioning as a goal-oriented hierarchical refinement planning problem, and further propose a novel framework, named Top-Down Semantic Refinement (TDSR), which models the generation process as a Markov Decision Process (MDP). However, planning within the vast state space of a VLM presents a significant computational hurdle. Our core contribution, therefore, is the design of a highly efficient Monte Carlo Tree Search (MCTS) algorithm tailored for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
