Top-Down Semantic Refinement for Image Captioning

Jusheng Zhang; Kaitong Cai; Jing Yang; Jian Wang; Chengpei Tang; Keze Wang

arXiv:2510.22391·cs.CV·February 17, 2026

Top-Down Semantic Refinement for Image Captioning

Jusheng Zhang, Kaitong Cai, Jing Yang, Jian Wang, Chengpei Tang, Keze Wang

PDF

1 Video

TL;DR

This paper introduces Top-Down Semantic Refinement (TDSR), a hierarchical planning framework for image captioning that improves global coherence and detail capture by modeling caption generation as a Markov Decision Process with efficient Monte Carlo Tree Search.

Contribution

It proposes a novel hierarchical refinement framework and an efficient MCTS algorithm tailored for vision-language models, significantly reducing computational costs while enhancing caption quality.

Findings

01

TDSR improves caption detail and coherence.

02

State-of-the-art results on multiple benchmarks.

03

Reduces VLM call frequency by an order of magnitude.

Abstract

Large Vision-Language Models (VLMs) face an inherent contradiction in image captioning: their powerful single-step generation capabilities often lead to a myopic decision-making process. This makes it difficult to maintain global narrative coherence while capturing rich details, a limitation that is particularly pronounced in tasks that require multi-step and complex scene description. To overcome this fundamental challenge, we redefine image captioning as a goal-oriented hierarchical refinement planning problem, and further propose a novel framework, named Top-Down Semantic Refinement (TDSR), which models the generation process as a Markov Decision Process (MDP). However, planning within the vast state space of a VLM presents a significant computational hurdle. Our core contribution, therefore, is the design of a highly efficient Monte Carlo Tree Search (MCTS) algorithm tailored for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Top-Down Semantic Refinement for Image Captioning· underline