Do Neural Network Cross-Modal Mappings Really Bridge Modalities?
Guillem Collell, Marie-Francine Moens

TL;DR
This paper investigates whether neural network mappings in cross-modal applications truly preserve semantic neighborhood structures, revealing that they often retain input structures more than target structures, with untrained nets also preserving input neighborhoods.
Contribution
It introduces a new similarity measure and experimental framework to evaluate neighborhood preservation in cross-modal neural mappings, highlighting limitations of current approaches.
Findings
Predicted vectors resemble input vectors' neighborhood structure more than target vectors'
Untrained networks do not significantly alter the neighborhood structure of input vectors
Neural mappings often fail to effectively bridge semantic neighborhoods across modalities
Abstract
Feed-forward networks are widely used in cross-modal applications to bridge modalities by mapping distributed vectors of one modality to the other, or to a shared space. The predicted vectors are then used to perform e.g., retrieval or labeling. Thus, the success of the whole system relies on the ability of the mapping to make the neighborhood structure (i.e., the pairwise similarities) of the predicted vectors akin to that of the target vectors. However, whether this is achieved has not been investigated yet. Here, we propose a new similarity measure and two ad hoc experiments to shed light on this issue. In three cross-modal benchmarks we learn a large number of language-to-vision and vision-to-language neural network mappings (up to five layers) using a rich diversity of image and text features and loss functions. Our results reveal that, surprisingly, the neighborhood structure of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
