# Speaker-story mapping as a method to evaluate audiovisual scene analysis in a virtual classroom scenario

**Authors:** Stephan Fremerey, Carolin Breuer, Larissa Leist, Maria Klatte, Janina Fels, Alexander Raake

PMC · DOI: 10.3389/fpsyg.2025.1520630 · 2025-06-10

## TL;DR

This study tests how audiovisual virtual environments can evaluate cognitive performance in classroom-like settings using a speaker-story mapping task.

## Contribution

It introduces a new method using immersive virtual environments to assess audiovisual scene analysis in educational scenarios.

## Key findings

- Performance in audiovisual scene analysis was significantly affected by the type of audio and visual representation used.
- Task performance decreased when using diotic audio or CGI-based visuals compared to binaural audio and 360° video.
- Mental load and user behavior varied across experimental conditions but simulator sickness and presence remained unaffected.

## Abstract

This study explores how audiovisual immersive virtual environments (IVEs) can assess cognitive performance in classroom-like settings, addressing limitations in simpler acoustic and visual representations. This study examines the potential of a test paradigm using speaker-story mapping, called “audiovisual scene analysis (AV-SA),” originally developed for virtual reality (VR) hearing research, as a method to evaluate audiovisual scene analysis in a virtual classroom scenario. Factors affecting acoustic and visual scene representation were varied to investigate their impact on audiovisual scene analysis. Two acoustic representations were used: a simple “diotic” presentation where the same signal is presented to both ears, as well as a dynamically live-rendered binaural synthesis (“binaural”). Two visual representations were used: 360°/omnidirectional video with intrinsic lip-sync and computer-generated imagery (CGI) without lip-sync. Three subjective experiments were conducted with different combinations of the two acoustic and visual conditions: The first experiment, involving 36 participants, used 360° video with “binaural” audio. The second experiment, with 24 participants, combined 360° video with “diotic” audio. The third experiment, with 34 participants, used the CGI environment with “binaural” audio. Each environment presented 20 different speakers in a classroom-like circle of 20 chairs, with the number of simultaneously active speakers ranging from 2 to 10, while the remaining speakers kept silent and were always shown. During the experiments, the subjects' task was to correctly map the stories' topics to the corresponding speakers. The primary dependent variable was the number of correct assignments during a fixed period of 2 min, followed by two questionnaires on mental load after each trial. In addition, before and/or after the experiments, subjects needed to complete questionnaires about simulator sickness, noise sensitivity, and presence. Results indicate that the experimental condition significantly influenced task performance, mental load, and user behavior but did not affect perceived simulator sickness and presence. Performance decreased when comparing the 360° video and “binaural” audio experiment with either the experiment using “diotic” audio and 360°, or using “binaural” audio with CGI-based, showing the usefulness of the test method in investigating influences on cognitive audiovisual scene analysis performance.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12185469/full.md

---
Source: https://tomesphere.com/paper/PMC12185469