CatchPhrase: EXPrompt-Guided Encoder Adaptation for Audio-to-Image Generation

Hyunwoo Oh; SeungJu Cha; Kwanyoung Lee; Si-Woo Kim; and Dong-Jin Kim

arXiv:2507.18750·cs.MM·July 28, 2025

CatchPhrase: EXPrompt-Guided Encoder Adaptation for Audio-to-Image Generation

Hyunwoo Oh, SeungJu Cha, Kwanyoung Lee, Si-Woo Kim, and Dong-Jin Kim

PDF

Open Access

TL;DR

CatchPhrase introduces a novel framework that uses large language models and audio captioning to generate semantic prompts, improving alignment between audio inputs and images in cross-modal generation.

Contribution

It presents a new method combining prompt mining, filtering, and a lightweight adaptation network to enhance audio-to-image generation accuracy.

Findings

01

Improves semantic alignment in audio-to-image generation.

02

Enhances image quality and relevance in experiments.

03

Mitigates issues caused by homographs and auditory illusions.

Abstract

We propose CatchPhrase, a novel audio-to-image generation framework designed to mitigate semantic misalignment between audio inputs and generated images. While recent advances in multi-modal encoders have enabled progress in cross-modal generation, ambiguity stemming from homographs and auditory illusions continues to hinder accurate alignment. To address this issue, CatchPhrase generates enriched cross-modal semantic prompts (EXPrompt Mining) from weak class labels by leveraging large language models (LLMs) and audio captioning models (ACMs). To address both class-level and instance-level misalignment, we apply multi-modal filtering and retrieval to select the most semantically aligned prompt for each audio sample (EXPrompt Selector). A lightweight mapping network is then trained to adapt pre-trained text-to-image generation models to audio input. Extensive experiments on multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Music Technology and Sound Studies