Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search

Linhao Yu; Xinguang Ji; Yahui Liu; Fanheng Kong; Chenxi Sun; Jingyuan Zhang; Hongzhi Zhang; V. W.; Fuzheng Zhang; Deyi Xiong

arXiv:2506.11155·cs.CV·June 16, 2025

Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search

Linhao Yu, Xinguang Ji, Yahui Liu, Fanheng Kong, Chenxi Sun, Jingyuan Zhang, Hongzhi Zhang, V. W., Fuzheng Zhang, Deyi Xiong

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces AutoCaption, an MCTS-based framework for generating diverse video captions, enabling comprehensive evaluation of multimodal large language models on detailed video understanding tasks.

Contribution

We propose AutoCaption, an iterative captioning method using Monte Carlo Tree Search to create diverse, detailed video descriptions for improved model evaluation.

Findings

01

AutoCaption effectively generates diverse video descriptions.

02

MCTS-VCB benchmark enables comprehensive evaluation of MLLMs.

03

Fine-tuning models with AutoCaption data improves performance significantly.

Abstract

Video captioning can be used to assess the video understanding capabilities of Multimodal Large Language Models (MLLMs). However, existing benchmarks and evaluation protocols suffer from crucial issues, such as inadequate or homogeneous creation of key points, exorbitant cost of data creation, and limited evaluation scopes. To address these issues, we propose an automatic framework, named AutoCaption, which leverages Monte Carlo Tree Search (MCTS) to construct numerous and diverse descriptive sentences (\textit{i.e.}, key points) that thoroughly represent video content in an iterative way. This iterative captioning strategy enables the continuous enhancement of video details such as actions, objects' attributes, environment details, etc. We apply AutoCaption to curate MCTS-VCB, a fine-grained video caption benchmark covering video details, thereby enabling a comprehensive evaluation of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

HasuerYu/AutoCaption
dataset· 163 dl
163 dl

Videos

Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Video Analysis and Summarization