Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language
Hassan Akbari, Hamid Palangi, Jianwei Yang, Sudha Rao, Asli, Celikyilmaz, Roland Fernandez, Paul Smolensky, Jianfeng Gao, Shih-Fu Chang

TL;DR
This paper introduces a neuro-symbolic model for video captioning that leverages inductive biases and relation learning to produce more accurate and interpretable captions, validated by automatic and human evaluations.
Contribution
The paper proposes a novel multi-modal neuro-symbolic architecture that learns spatial, temporal, and cross-modal relations using dictionary learning and attention, enhancing caption quality and interpretability.
Findings
Improved caption quality on two datasets
Enhanced grounding and relevance in human evaluations
Effective learning of multi-modal relations
Abstract
Neuro-symbolic representations have proved effective in learning structure information in vision and language. In this paper, we propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning. Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions. We refer to these relations as relative roles and leverage them to make each token role-aware using attention. This results in a more structured and interpretable architecture that incorporates modality-specific inductive biases for the captioning task. Intuitively, the model is able to learn spatial, temporal, and cross-modal relations in a given pair of video and text. The disentanglement achieved by our proposal gives the model more capacity to capture multi-modal structures which result in captions with higher quality for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
