Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal   Attention

Katsuyuki Nakamura; Hiroki Ohashi; Mitsuhiro Okada

arXiv:2109.02955·cs.CV·September 8, 2021

Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention

Katsuyuki Nakamura, Hiroki Ohashi, Mitsuhiro Okada

PDF

1 Repo

TL;DR

This paper introduces a new sensor-augmented egocentric-video captioning task, a dedicated dataset, and an attention-based method that effectively combines video and sensor data to improve detailed activity descriptions.

Contribution

It proposes a novel task, dataset, and multi-modal attention method for egocentric video captioning utilizing wearable sensor data.

Findings

01

Sensor data improves captioning accuracy.

02

The proposed method outperforms strong baselines.

03

Multi-modal attention effectively fuses video and sensor data.

Abstract

Automatically describing video, or video captioning, has been widely studied in the multimedia field. This paper proposes a new task of sensor-augmented egocentric-video captioning, a newly constructed dataset for it called MMAC Captions, and a method for the newly proposed task that effectively utilizes multi-modal data of video and motion sensors, or inertial measurement units (IMUs). While conventional video captioning tasks have difficulty in dealing with detailed descriptions of human activities due to the limited view of a fixed camera, egocentric vision has greater potential to be used for generating the finer-grained descriptions of human activities on the basis of a much closer view. In addition, we utilize wearable-sensor data as auxiliary information to mitigate the inherent problems in egocentric vision: motion blur, self-occlusion, and out-of-camera-range activities. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hitachi-rd-cv/mmac_captions
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.