Generalization in Multimodal Language Learning from Simulation
Aaron Eisermann, Jae Hee Lee, Cornelius Weber, Stefan Wermter

TL;DR
This paper explores how multimodal input data from simulation can enhance the ability of neural networks, specifically LSTMs, to generalize compositionally beyond their training distribution, addressing limitations seen in single-modality models.
Contribution
It introduces a new multimodal dataset from simulation and demonstrates that multimodality significantly improves compositional generalization in neural networks.
Findings
Multimodal input improves generalization where vision alone struggles.
Increasing objects, actions, and color overlaps enhances compositional learning.
Simple setups show poor generalization, but complexity and multimodality help.
Abstract
Neural networks can be powerful function approximators, which are able to model high-dimensional feature distributions from a subset of examples drawn from the target distribution. Naturally, they perform well at generalizing within the limits of their target function, but they often fail to generalize outside of the explicitly learned feature space. It is therefore an open research topic whether and how neural network-based architectures can be deployed for systematic reasoning. Many studies have shown evidence for poor generalization, but they often work with abstract data or are limited to single-channel input. Humans, however, learn and interact through a combination of multiple sensory modalities, and rarely rely on just one. To investigate compositional generalization in a multimodal setting, we generate an extensible dataset with multimodal input sequences from simulation. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
