Learning Bidirectional Action-Language Translation with Limited Supervision and Incongruent Input
Ozan \"Ozdemir, Matthias Kerzel, Cornelius Weber, Jae Hee Lee,, Muhammad Burhan Hafez, Patrick Bruns, Stefan Wermter

TL;DR
This paper introduces a deep learning framework for bidirectional action-language translation that performs well with limited supervision and handles incongruent multimodal inputs, inspired by human infant learning.
Contribution
It proposes the PTAE model with Transformer-based crossmodal attention, improving translation accuracy under scarce supervision and modeling incongruence effects.
Findings
PTAE outperforms previous models in low-supervision scenarios.
Model's output degrades with conflicting multimodal input, especially with action conflicts.
PTAE remains plausible when tested with incongruent data.
Abstract
Human infant learning happens during exploration of the environment, by interaction with objects, and by listening to and repeating utterances casually, which is analogous to unsupervised learning. Only occasionally, a learning infant would receive a matching verbal description of an action it is committing, which is similar to supervised learning. Such a learning mechanism can be mimicked with deep learning. We model this weakly supervised learning paradigm using our Paired Gated Autoencoders (PGAE) model, which combines an action and a language autoencoder. After observing a performance drop when reducing the proportion of supervised training, we introduce the Paired Transformed Autoencoders (PTAE) model, using Transformer-based crossmodal attention. PTAE achieves significantly higher accuracy in language-to-action and action-to-language translations, particularly in realistic but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Topic Modeling · Multimodal Machine Learning Applications
MethodsTest
