Neuro-Symbolic Representations for Video Captioning: A Case for   Leveraging Inductive Biases for Vision and Language

Hassan Akbari; Hamid Palangi; Jianwei Yang; Sudha Rao; Asli; Celikyilmaz; Roland Fernandez; Paul Smolensky; Jianfeng Gao; Shih-Fu Chang

arXiv:2011.09530·cs.CV·November 20, 2020·1 cites

Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language

Hassan Akbari, Hamid Palangi, Jianwei Yang, Sudha Rao, Asli, Celikyilmaz, Roland Fernandez, Paul Smolensky, Jianfeng Gao, Shih-Fu Chang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a neuro-symbolic model for video captioning that leverages inductive biases and relation learning to produce more accurate and interpretable captions, validated by automatic and human evaluations.

Contribution

The paper proposes a novel multi-modal neuro-symbolic architecture that learns spatial, temporal, and cross-modal relations using dictionary learning and attention, enhancing caption quality and interpretability.

Findings

01

Improved caption quality on two datasets

02

Enhanced grounding and relevance in human evaluations

03

Effective learning of multi-modal relations

Abstract

Neuro-symbolic representations have proved effective in learning structure information in vision and language. In this paper, we propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning. Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions. We refer to these relations as relative roles and leverage them to make each token role-aware using attention. This results in a more structured and interpretable architecture that incorporates modality-specific inductive biases for the captioning task. Intuitively, the model is able to learn spatial, temporal, and cross-modal relations in a given pair of video and text. The disentanglement achieved by our proposal gives the model more capacity to capture multi-modal structures which result in captions with higher quality for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hassanhub/R3Transformer
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization