# The Cross‐Linguistic Coordination of Overt Attention and Speech Production as Evidence for a Language of Vision

**Authors:** Moreno I. Coco, Eunice G. Fernandes, Manabu Arai, Frank Keller

PMC · DOI: 10.1111/cogs.70185 · Cognitive Science · 2026-02-23

## TL;DR

This study shows that how people visually scan scenes while describing them is guided by a universal language of vision, not just by the structure of the spoken language.

## Contribution

The study provides empirical evidence for a universal language of vision that underlies cross-modal coordination in speech and visual attention.

## Key findings

- Semantic similarity between sentences predicts similar eye movement patterns across languages.
- Syntactic effects on eye movements are limited to early planning phases and within-language comparisons.
- A universal language of vision supports stable semantic coordination across modalities.

## Abstract

A central question in cognition is how representations are integrated across different modalities, such as language and vision. One prominent hypothesis posits the existence of an abstract, prelinguistic “language of vision” as a representational system that organizes meaning compositionally, enabling cross‐modal integration. This hypothesis predicts that the language of vision operates universally, independent of linguistic surface features such as word order. We conducted eye‐tracking experiments where participants described visual scenes in English, Portuguese, and Japanese. By analyzing spoken descriptions alongside eye‐movement sequences divided into planning and articulation phases, we demonstrate that semantic similarity between sentences strongly predicts the similarity of associated scan patterns in all three languages, even across scenes and between sentences in different languages. In contrast, the effect of syntactic constraints was secondary and transient: it was restricted to within‐language and within‐scene comparisons, and temporally confined to the early planning phase of the utterance. Our findings support an interactive account of cross‐modal coordination in which a universal language of vision provides stable semantic scaffolding, while syntax serves as a local constraint, primarily active during message linearization.

## Full-text entities

- **Species:** Canis lupus familiaris (dog, subspecies) [taxon 9615], Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12930141/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12930141/full.md

## References

135 references — full list in the complete paper: https://tomesphere.com/paper/PMC12930141/full.md

---
Source: https://tomesphere.com/paper/PMC12930141