Audio-Image Cross-Modal Retrieval with Onomatopoeic Images

Keisuke Imoto; Yamato Kojima; Takao Tsuchiya

arXiv:2605.17509·eess.AS·May 19, 2026

Audio-Image Cross-Modal Retrieval with Onomatopoeic Images

Keisuke Imoto, Yamato Kojima, Takao Tsuchiya

PDF

1 Datasets

TL;DR

This paper proposes a new framework for cross-modal retrieval between onomatopoeic images and sounds, introducing a dataset and training modality-specific projection heads to improve retrieval accuracy.

Contribution

It introduces the Multimodal Image-Audio Onomatopoeia dataset (MIAO) and a novel training approach that outperforms zero-shot baselines in cross-modal retrieval tasks.

Findings

01

Proposed method significantly outperforms zero-shot CLIP and CLAP baselines.

02

Adapting pretrained representations improves bidirectional retrieval accuracy.

03

Constructed MIAO dataset with 50 sound event classes for onomatopoeic images and sounds.

Abstract

Finding sound effects or environmental sounds that match a creator's intended impression remains a largely manual process in multimedia production. This is especially relevant for comics and other visual media, where visually stylized onomatopoeic expressions convey auditory impressions through letter shapes, strokes, layouts, and decorative patterns. However, cross-modal retrieval between onomatopoeic images and general sounds has been largely unexplored. This paper thus introduces a bidirectional retrieval framework between onomatopoeic images and the corresponding sound clips. Instead of directly comparing embeddings extracted from pretrained image and audio encoder, we train modality-specific projection heads that re-align the embeddings for visual onomatopoeia and corresponding sounds. We then construct the Multimodal Image-Audio Onomatopoeia dataset (MIAO), which contains paired…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

KeisukeImoto/MIAO
dataset· 1.1k dl
1.1k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.