# Autonomous agents: Augmenting visual information with raw audio data

**Authors:** Enoch Solomon

PMC · DOI: 10.1371/journal.pone.0318372 · PLOS One · 2025-05-23

## TL;DR

This paper shows that adding raw audio to visual data helps AI agents learn faster and perform better in games.

## Contribution

The novelty is integrating raw audio with visual data to improve reinforcement learning agent performance.

## Key findings

- Agents using both audio and visual data achieved higher rewards than those using only visual data.
- Learning rates improved when raw audio was included in state representation.
- Enhanced agent behavior was observed across multiple game environments.

## Abstract

In the realm of game playing, deep reinforcement learning predominantly relies on visual input to map states to actions. The visual data extracted from the game environment serves as the primary foundation for state representation in reinforcement learning agents. However, humans leverage additional sensory inputs, such as audio cues, which play a pivotal role in perception and decision-making. Therefore, incorporating raw audio along with visual information shows potential for offering valuable insights to reinforcement learning agents. This study advocates for the integration of raw audio samples as complementary information to visual data in state representation. By using raw audio with visual cues, our objective is to enrich the decision-making process of the agent at each stage. Experimental evaluation were conducted employing Deep Q Networks (DQN) and Proximal Policy Optimization (PPO) algorithms within ViZDoom and Unity reinforcement learning environments. The results of our experiments reveal that augmenting visual information with raw audio samples yields superior rewards and expedites the learning rate compared to relying solely on visual data. Additionally, the findings suggest that considering both visual and audio features enhances the agent’s behavior, a trend observed across Unity and ViZDoom environments. This study underscores the potential advantages of incorporating multisensory information, particularly raw audio, into the state representation of reinforcement learning agents. Such insights contribute to advancing our understanding of how agents perceive and engage with their environments, ultimately enhancing performance in complex gaming scenarios.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12101691/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12101691/full.md

## References

33 references — full list in the complete paper: https://tomesphere.com/paper/PMC12101691/full.md

---
Source: https://tomesphere.com/paper/PMC12101691