VidChain: Chain-of-Tasks with Metric-based Direct Preference   Optimization for Dense Video Captioning

Ji Soo Lee; Jongha Kim; Jeehye Na; Jinyoung Park; Hyunwoo J. Kim

arXiv:2501.06761·cs.CV·January 14, 2025

VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning

Ji Soo Lee, Jongha Kim, Jeehye Na, Jinyoung Park, Hyunwoo J. Kim

PDF

1 Repo 1 Datasets 1 Video

TL;DR

VidChain introduces a novel framework that decomposes dense video captioning into sub-tasks and aligns training with evaluation metrics, significantly enhancing fine-grained video understanding in large language models.

Contribution

The paper proposes VidChain, combining Chain-of-Tasks and Metric-based Direct Preference Optimization to improve dense video captioning and temporal grounding.

Findings

01

Outperforms previous VideoLLMs on DVC benchmarks

02

Enhances fine-grained video understanding

03

Improves temporal video grounding results

Abstract

Despite the advancements of Video Large Language Models (VideoLLMs) in various tasks, they struggle with fine-grained temporal understanding, such as Dense Video Captioning (DVC). DVC is a complicated task of describing all events within a video while also temporally localizing them, which integrates multiple fine-grained tasks, including video segmentation, video captioning, and temporal video grounding. Previous VideoLLMs attempt to solve DVC in a single step, failing to utilize their reasoning capability. Moreover, previous training objectives for VideoLLMs do not fully reflect the evaluation metrics, therefore not providing supervision directly aligned to target tasks. To address such a problem, we propose a novel framework named VidChain comprised of Chain-of-Tasks (CoTasks) and Metric-based Direct Preference Optimization (M-DPO). CoTasks decompose a complex task into a sequence of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mlvlab/vidchain
pytorchOfficial

Datasets

simplecloud/VidChain-exercise
dataset· 1.0k dl
1.0k dl

Videos

VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning· underline