# Interactive Natural Language Acquisition in a Multi-modal Recurrent   Neural Architecture

**Authors:** Stefan Heinrich, Stefan Wermter

arXiv: 1703.08513 · 2018-02-08

## TL;DR

This paper presents a neurocognitively plausible multi-modal recurrent neural architecture enabling a humanoid robot to acquire language through real-world interaction, integrating vision and somatosensation with hierarchical and self-organizing principles.

## Contribution

It introduces a novel multi-modal recurrent neural model inspired by brain mechanisms for embodied language acquisition in robots, combining hierarchical abstraction and self-organization.

## Key findings

- Model successfully learns language grounded in sensory modalities
- Demonstrates hierarchical concept abstraction and decomposition
- Achieves multi-modal integration and self-organization

## Abstract

For the complex human brain that enables us to communicate in natural language, we gathered good understandings of principles underlying language acquisition and processing, knowledge about socio-cultural conditions, and insights about activity patterns in the brain. However, we were not yet able to understand the behavioural and mechanistic characteristics for natural language and how mechanisms in the brain allow to acquire and process language. In bridging the insights from behavioural psychology and neuroscience, the goal of this paper is to contribute a computational understanding of appropriate characteristics that favour language acquisition. Accordingly, we provide concepts and refinements in cognitive modelling regarding principles and mechanisms in the brain and propose a neurocognitively plausible model for embodied language acquisition from real world interaction of a humanoid robot with its environment. In particular, the architecture consists of a continuous time recurrent neural network, where parts have different leakage characteristics and thus operate on multiple timescales for every modality and the association of the higher level nodes of all modalities into cell assemblies. The model is capable of learning language production grounded in both, temporal dynamic somatosensation and vision, and features hierarchical concept abstraction, concept decomposition, multi-modal integration, and self-organisation of latent representations.

---
Source: https://tomesphere.com/paper/1703.08513