TL;DR
OwlCap is a novel video captioning model that balances motion and detail by leveraging a new dataset, HMD-270K, and a specialized reward, CSER, to improve caption completeness and accuracy.
Contribution
The paper introduces HMD-270K dataset and CSER reward, enabling a new multi-modal large language model, OwlCap, that effectively balances motion and detail in video captioning.
Findings
OwlCap outperforms baselines on detail-focused and motion-focused benchmarks.
HMD-270K dataset enhances training for motion-detail balanced captioning.
CSER improves caption completeness and accuracy.
Abstract
Video captioning aims to generate comprehensive and coherent descriptions of the video content, contributing to the advancement of both video understanding and generation. However, existing methods often suffer from motion-detail imbalance, as models tend to overemphasize one aspect while neglecting the other. This imbalance results in incomplete captions, which in turn leads to a lack of consistency in video understanding and generation. To address this issue, we propose solutions from two aspects: 1) Data aspect: We constructed the Harmonizing Motion-Detail 270K (HMD-270K) dataset through a two-stage pipeline: Motion-Detail Fusion (MDF) and Fine-Grained Examination (FGE). 2) Optimization aspect: We introduce the Caption Set Equivalence Reward (CSER) based on Group Relative Policy Optimization (GRPO). CSER enhances completeness and accuracy in capturing both motion and details through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
