A Case Study on Combining ASR and Visual Features for Generating   Instructional Video Captions

Jack Hessel; Bo Pang; Zhenhai Zhu; Radu Soricut

arXiv:1910.02930·cs.CL·October 8, 2019

A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

Jack Hessel, Bo Pang, Zhenhai Zhu, Radu Soricut

PDF

TL;DR

This paper demonstrates that combining ASR tokens with visual features significantly enhances automatic instructional video captioning, especially for fine-grained distinctions, outperforming methods using visual features alone.

Contribution

It introduces a joint modeling approach that leverages both ASR tokens and visual features to improve instructional video captioning accuracy.

Findings

01

Joint modeling outperforms individual modality models.

02

Visual features capture background context better.

03

ASR tokens disambiguate fine-grained actions.

Abstract

Instructional videos get high-traffic on video sharing platforms, and prior work suggests that providing time-stamped, subtask annotations (e.g., "heat the oil in the pan") improves user experiences. However, current automatic annotation methods based on visual features alone perform only slightly better than constant prediction. Taking cues from prior work, we show that we can improve performance significantly by considering automatic speech recognition (ASR) tokens as input. Furthermore, jointly modeling ASR tokens and visual features results in higher performance compared to training individually on either modality. We find that unstated background information is better explained by visual features, whereas fine-grained distinctions (e.g., "add oil" vs. "add olive oil") are disambiguated more easily via ASR tokens.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.