Audio-Image Cross-Modal Retrieval with Onomatopoeic Images
Keisuke Imoto, Yamato Kojima, Takao Tsuchiya

TL;DR
This paper proposes a new framework for cross-modal retrieval between onomatopoeic images and sounds, introducing a dataset and training modality-specific projection heads to improve retrieval accuracy.
Contribution
It introduces the Multimodal Image-Audio Onomatopoeia dataset (MIAO) and a novel training approach that outperforms zero-shot baselines in cross-modal retrieval tasks.
Findings
Proposed method significantly outperforms zero-shot CLIP and CLAP baselines.
Adapting pretrained representations improves bidirectional retrieval accuracy.
Constructed MIAO dataset with 50 sound event classes for onomatopoeic images and sounds.
Abstract
Finding sound effects or environmental sounds that match a creator's intended impression remains a largely manual process in multimedia production. This is especially relevant for comics and other visual media, where visually stylized onomatopoeic expressions convey auditory impressions through letter shapes, strokes, layouts, and decorative patterns. However, cross-modal retrieval between onomatopoeic images and general sounds has been largely unexplored. This paper thus introduces a bidirectional retrieval framework between onomatopoeic images and the corresponding sound clips. Instead of directly comparing embeddings extracted from pretrained image and audio encoder, we train modality-specific projection heads that re-align the embeddings for visual onomatopoeia and corresponding sounds. We then construct the Multimodal Image-Audio Onomatopoeia dataset (MIAO), which contains paired…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
