OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward

Chunlin Zhong; Qiuxia Hou; Zhangjun Zhou; Shuang Hao; Haonan Lu; Yanhao Zhang; He Tang; Xiang Bai

arXiv:2508.18634·cs.CV·August 28, 2025

OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward

Chunlin Zhong, Qiuxia Hou, Zhangjun Zhou, Shuang Hao, Haonan Lu, Yanhao Zhang, He Tang, Xiang Bai

PDF

1 Video

TL;DR

OwlCap is a novel video captioning model that balances motion and detail by leveraging a new dataset, HMD-270K, and a specialized reward, CSER, to improve caption completeness and accuracy.

Contribution

The paper introduces HMD-270K dataset and CSER reward, enabling a new multi-modal large language model, OwlCap, that effectively balances motion and detail in video captioning.

Findings

01

OwlCap outperforms baselines on detail-focused and motion-focused benchmarks.

02

HMD-270K dataset enhances training for motion-detail balanced captioning.

03

CSER improves caption completeness and accuracy.

Abstract

Video captioning aims to generate comprehensive and coherent descriptions of the video content, contributing to the advancement of both video understanding and generation. However, existing methods often suffer from motion-detail imbalance, as models tend to overemphasize one aspect while neglecting the other. This imbalance results in incomplete captions, which in turn leads to a lack of consistency in video understanding and generation. To address this issue, we propose solutions from two aspects: 1) Data aspect: We constructed the Harmonizing Motion-Detail 270K (HMD-270K) dataset through a two-stage pipeline: Motion-Detail Fusion (MDF) and Fine-Grained Examination (FGE). 2) Optimization aspect: We introduce the Caption Set Equivalence Reward (CSER) based on Group Relative Policy Optimization (GRPO). CSER enhances completeness and accuracy in capturing both motion and details through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward· underline