A Multimodal Framework for Aligning Human Linguistic Descriptions with Visual Perceptual Data

Joseph Bingham

arXiv:2602.19562·cs.AI·February 24, 2026

A Multimodal Framework for Aligning Human Linguistic Descriptions with Visual Perceptual Data

Joseph Bingham

PDF

Open Access

TL;DR

This paper presents a computational framework that aligns natural language descriptions with visual data, demonstrating near-human performance on a reference game benchmark by integrating perceptual features with linguistic processing.

Contribution

The work introduces a novel multimodal alignment model combining perceptual similarity measures with linguistic pragmatics, achieving human-competitive referential grounding performance.

Findings

01

Achieves 41.66% accuracy in identifying target objects from single expressions.

02

Requires 65% fewer utterances than humans to establish stable mappings.

03

Provides a scalable, cognitively plausible approach to cross-modal reference understanding.

Abstract

Establishing stable mappings between natural language expressions and visual percepts is a foundational problem for both cognitive science and artificial intelligence. Humans routinely ground linguistic reference in noisy, ambiguous perceptual contexts, yet the mechanisms supporting such cross-modal alignment remain poorly understood. In this work, we introduce a computational framework designed to model core aspects of human referential interpretation by integrating linguistic utterances with perceptual representations derived from large-scale, crowd-sourced imagery. The system approximates human perceptual categorization by combining scale-invariant feature transform (SIFT) alignment with the Universal Quality Index (UQI) to quantify similarity in a cognitively plausible feature space, while a set of linguistic preprocessing and query-transformation operations captures pragmatic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Action Observation and Synchronization