A Multimodal Framework for Aligning Human Linguistic Descriptions with Visual Perceptual Data
Joseph Bingham

TL;DR
This paper presents a computational framework that aligns natural language descriptions with visual data, demonstrating near-human performance on a reference game benchmark by integrating perceptual features with linguistic processing.
Contribution
The work introduces a novel multimodal alignment model combining perceptual similarity measures with linguistic pragmatics, achieving human-competitive referential grounding performance.
Findings
Achieves 41.66% accuracy in identifying target objects from single expressions.
Requires 65% fewer utterances than humans to establish stable mappings.
Provides a scalable, cognitively plausible approach to cross-modal reference understanding.
Abstract
Establishing stable mappings between natural language expressions and visual percepts is a foundational problem for both cognitive science and artificial intelligence. Humans routinely ground linguistic reference in noisy, ambiguous perceptual contexts, yet the mechanisms supporting such cross-modal alignment remain poorly understood. In this work, we introduce a computational framework designed to model core aspects of human referential interpretation by integrating linguistic utterances with perceptual representations derived from large-scale, crowd-sourced imagery. The system approximates human perceptual categorization by combining scale-invariant feature transform (SIFT) alignment with the Universal Quality Index (UQI) to quantify similarity in a cognitively plausible feature space, while a set of linguistic preprocessing and query-transformation operations captures pragmatic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Action Observation and Synchronization
