Artificial-Spiking Hierarchical Networks for Vision-Language Representation Learning
Yeming Chen, Siyu Zhang, Yaoru Sun, Weijian Liang, Haoran Wang

TL;DR
This paper introduces ASH-Nets, a novel hierarchical model combining artificial and spiking neural networks to improve multimodal vision-language representations through semantic encoding and contrastive learning.
Contribution
The work presents a flexible hierarchical network integrating ANNs and SNNs, with novel semantic encoders and a pre-training method for enhanced vision-language task performance.
Findings
Achieves competitive results on multiple VL benchmarks.
Improves semantic encoding with discrete and continuous latent variables.
Enhances efficiency through contrastive learning and hard sample augmentation.
Abstract
With the success of self-supervised learning, multimodal foundation models have rapidly adapted a wide range of downstream tasks driven by vision and language (VL) pretraining. State-of-the-art methods achieve impressive performance by pre-training on large-scale datasets. However, bridging the semantic gap between the two modalities remains a nonnegligible challenge for VL tasks. In this work, we propose an efficient computation framework for multimodal alignment by introducing a novel visual semantic module to further improve the performance of the VL tasks. Specifically, we propose a flexible model, namely Artificial-Spiking Hierarchical Networks (ASH-Nets), which combines the complementary advantages of Artificial neural networks (ANNs) and Spiking neural networks (SNNs) to enrich visual semantic representations. In particular, a visual concrete encoder and a semantic abstract…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Genomics and Phylogenetic Studies
MethodsContrastive Learning
