Do You Remember? Dense Video Captioning with Cross-Modal Memory   Retrieval

Minkuk Kim; Hyeon Bae Kim; Jinyoung Moon; Jinwoo Choi; Seong Tae Kim

arXiv:2404.07610·cs.CV·April 12, 2024·2 cites

Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval

Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, Seong Tae Kim

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel dense video captioning framework that leverages external memory and cross-modal retrieval to improve event localization and captioning without extensive pretraining.

Contribution

It proposes a new model inspired by human cognitive processing, incorporating external memory and cross-modal retrieval for better dense video captioning.

Findings

01

Effective on ActivityNet Captions dataset

02

Outperforms existing methods without large-scale pretraining

03

Utilizes cross-modal memory retrieval for improved semantic understanding

Abstract

There has been significant attention to the research on dense video captioning, which aims to automatically localize and caption all events within untrimmed video. Several studies introduce methods by designing dense video captioning as a multitasking problem of event localization and event captioning to consider inter-task relations. However, addressing both tasks using only visual input is challenging due to the lack of semantic content. In this study, we address this by proposing a novel framework inspired by the cognitive information processing of humans. Our model utilizes external memory to incorporate prior knowledge. The memory retrieval method is proposed with cross-modal video-to-text matching. To effectively incorporate retrieved text features, the versatile encoder and the decoder with visual and textual cross-attention modules are designed. Comparative experiments have been…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ailab-kyunghee/cm2_dvc
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques