Audio Retrieval for Multimodal Design Documents: A New Dataset and   Algorithms

Prachi Singh; Srikrishna Karanam; Sumit Shekhar

arXiv:2302.14757·cs.MM·March 1, 2023

Audio Retrieval for Multimodal Design Documents: A New Dataset and Algorithms

Prachi Singh, Srikrishna Karanam, Sumit Shekhar

PDF

Open Access

TL;DR

This paper introduces MELON, a large-scale dataset of multimodal design documents with paired audio, and proposes a novel cross-attention algorithm for retrieving relevant audio, improving multimodal content matching and accessibility.

Contribution

The paper presents a new dataset, MELON, and a novel multimodal cross-attention retrieval algorithm for matching audio with design documents involving images and text.

Findings

01

Our method outperforms existing state-of-the-art approaches.

02

The dataset provides a new benchmark for multimodal audio retrieval.

03

The approach enhances accessibility for visually impaired users.

Abstract

We consider and propose a new problem of retrieving audio files relevant to multimodal design document inputs comprising both textual elements and visual imagery, e.g., birthday/greeting cards. In addition to enhancing user experience, integrating audio that matches the theme/style of these inputs also helps improve the accessibility of these documents (e.g., visually impaired people can listen to the audio instead). While recent work in audio retrieval exists, these methods and datasets are targeted explicitly towards natural images. However, our problem considers multimodal design documents (created by users using creative software) substantially different from a naturally clicked photograph. To this end, our first contribution is collecting and curating a new large-scale dataset called Melodic-Design (or MELON), comprising design documents representing various styles, themes,…

Tables3

Table 1. Table 1 : Melodic-Design vs. other datasets. I, A, and T corresponds to image, audio, and text respectively. “Various” for Melodic-Design covers illustrations, vectors, template, and background designs. C, R, G, MR represents classification, retrieval, generation, and music retrieval respectively.

Dataset

Visual

Content

Modalities

#Images

#Audio

Tasks

VGGSound [2]

Action vid.

I+A

199k

C

Audio set [6]

Human vid.

I+ A

2.1m

R + G

MUGEN [5]

Game vid.

I+A+T

233K

R+G

Shuttersong [3]

Images

I+A+T

17k

MR + C

IMEMNet [7]

Images

I+A

25k

1.8k

MR

MELON

[Proposed]

Various

I+A+T

488k

7.7k

MR + C

Table 2. Table 2 : Lists of moods/themes used for training and evaluation.

adventure	advertising	drama	funny	love
fun	commercial	dramatic	groovy	romantic
game	corporate	movie	happy	nature
holiday	ambiental	dream	hopeful	summer
horror	calm	emotional	motivational	retro
space	relaxing	heavy	melodic	background
sport	soft	melancholic	children
upbeat	mellow	sad	christmas

Table 3. Table 3 : Proposed MMCAR vs. baselines.

Methods	$M e d$ $r$	Recall@k
Methods	$M e d$ $r$	k=1	k=5	k=10	k=15	k=20
Val
JTAV [1]	17.1	0.12	0.30	0.35	0.46	0.62
Wav2CLIP [10]	9.2	0.12	0.37	0.66	0.83	0.92
MMCAR [Ours]	3.8	0.42	0.80	0.92	0.96	0.98
MMCAR*	7.0	0.23	0.59	0.77	0.87	0.93
Test
JTAV [1]	16.9	0.12	0.30	0.35	0.45	0.62
Wav2CLIP [10]	9.2	0.11	0.39	0.67	0.83	0.91
MMCAR [Ours]	3.9	0.42	0.79	0.92	0.96	0.98
MMCAR*	7.2	0.22	0.56	0.76	0.89	0.93

Equations8

C_{x y} S_{x} \hat{x} = x y^{T} \in R^{d X d} an d C_{y x} = y x^{T} = σ (C_{x y} * W + B) \in R^{d X d} = d ia g (S_{x} C_{x y}^{T})

C_{x y} S_{x} \hat{x} = x y^{T} \in R^{d X d} an d C_{y x} = y x^{T} = σ (C_{x y} * W + B) \in R^{d X d} = d ia g (S_{x} C_{x y}^{T})

S_{y} \hat{y} = σ (S_{y x} * W + B) \in R^{d X d} = d ia g (S_{y} C_{y x}^{T})

S_{y} \hat{y} = σ (S_{y x} * W + B) \in R^{d X d} = d ia g (S_{y} C_{y x}^{T})

u_{x y} u_{a l l} = \hat{x} \oplus \hat{y} \in R^{2 d X 1} = u_{i t} \oplus u_{ia} \oplus u_{t a} \in R^{6 d X 1}

u_{x y} u_{a l l} = \hat{x} \oplus \hat{y} \in R^{2 d X 1} = u_{i t} \oplus u_{ia} \oplus u_{t a} \in R^{6 d X 1}

L = \frac{1}{B} i = 1 \sum B z log (\overset{z}{^}) + (1 - z) log (1 - \overset{z}{^})

L = \frac{1}{B} i = 1 \sum B z log (\overset{z}{^}) + (1 - z) log (1 - \overset{z}{^})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Subtitles and Audiovisual Media · Diverse Musicological Studies

Full text

Audio Retrieval for Multimodal Design Documents: A New Dataset and Algorithms

Abstract

We consider and propose a new problem of retrieving audio files relevant to multimodal design document inputs comprising both textual elements and visual imagery, e.g., birthday/greeting cards. In addition to enhancing user experience, integrating audio that matches the theme/style of these inputs also helps improve the accessibility of these documents (e.g., visually impaired people can listen to the audio instead).

While recent work in audio retrieval exists, these methods and datasets are targeted explicitly towards natural images. However, our problem considers multimodal design documents (created by users using creative software) substantially different from a naturally clicked photograph. To this end, our first contribution is collecting and curating a new large-scale dataset called Melodic-Design (or MELON), comprising design documents representing various styles, themes, templates, illustrations, etc., paired with music audio. Given our paired image-text-audio dataset, our next contribution is a novel multimodal cross-attention audio retrieval (MMCAR) algorithm that enables training neural networks to learn a common shared feature space across image, text, and audio dimensions. We use these learned features to demonstrate that our method outperforms existing state-of-the-art methods and produce a new reference benchmark for the research community on our new dataset.

Index Terms— Music Retrieval, Multimodal processing, cross attention.

1 Introduction

With increasing proliferation of on-demand web/mobile-based graphics design softwares111https://www.adobe.com/express*,* 222https://www.canva.com*,* 333https://www.sketch.com, designing creative documents has become very easy for any occasion, e.g., greeting cards, event invitations/flyers, social media infographics etc. In most cases, such design documents tend to be multimodal, i.e., they comprise some visual imagery aspects and some textual elements (see Fig 1(right)). For such documents, adding an additional modality in the form of relevant audio/music files will not only enhance the consumption experience of users but also improve document accessibility for visually impaired users. To this end, our first contribution is the consideration and proposal of a new problem involving the retrieval of relevant audio files given a multimodal design document. While much work in the past [1, 2] has focused on audio retrieval for natural images, there has not been any work in the context of the kind of design documents referred to above, and this paper takes a step towards bridging this gap in the literature (see Fig 1).

As the problem is unexplored, the existing datasets [3, 2] for audio retrieval contains only natural images and are not suitable for our proposed problem. To this end, our second contribution is the collection and curation of a new paired design-audio dataset that comprises multimodal design documents, scraped from publicly available data from Adobe Stock 444https://stock.adobe.com/, paired with relevant audio files collected from the MTG-Jamendo [4] repository. With $\approx 500k$ design documents paired with over $\approx 7.5k$ audio files, this is a first-of-its-kind dataset that we believe will help advance research in multimodal design understanding.

Finally, our third contribution is a novel multimodal cross-attention algorithm that enables training neural networks to learn a shared representation among the image, text, and audio modalities present in our problem setting. In particular, given paired design-audio samples, we extract individual modality features and learn per-pair as well as overall weights to learn a unified design-audio embedding. With extensive experiments on our proposed new dataset, we demonstrate our algorithm substantially outperforms the existing state-of-the-art audio retrieval methods.

2 Related Work

As noted in Section 1, works in multimodal audio retrieval are mainly focussed on natural images. In particular, in Image2Song [3], the shuttersong dataset was used to map images and song lyrics to the same feature space, which was then used for downstream tasks like retrieval. Using the same dataset, Liang et al. [1] proposed a method to jointly learn a feature space by fusing visual and acoustic features. Similarly, even datasets like VGGSound [2], and MUGEN [5], while having a video modality, also focus on either naturally occurring human actions or videos generated by game engines. In contrast, our contribution is unique by proposing a new dataset solely focused on multimodal creative design documents like greeting cards, infographics etc., that are commonly created using creative design software. Our large-scale dataset comprising hundreds of thousands of design documents paired with audio files provides a challenging testbed for advancing retrieval research.

3 Melodic Design (MELON) - A new dataset

As discussed in Section 1, given a multimodal design document like the ones shown in Fig 1, our problem is one of retrieving a short list of audio files that go well with the various elements of the input. For example, the first row/second column in Fig 1 shows an adventure-themed design document containing both images and text. Different elements include the background image/color, text fields (e.g., “Mountains”), and decorative elements and shapes. Note that each of these elements forms a layer in the design document (e.g., background image is the background layer, and the textual greeting is the foreground layer), giving a multi-layered multimodal design document. As noted in Section 1, and also from Figure 1, existing datasets focus solely on natural images, whereas our problem entails design documents. To bridge this clear gap in the literature, we collect and curate a new dataset comprising pairs of multimodal design documents and corresponding audio files, and we call our dataset “MELOdic desigN” (MELON).

3.1 Collecting Raw Dataset Samples

We use the publicly available MTG-Jamendo [4] database that encompasses a variety of mood/theme categories, instruments, and genres as our source of audio files. We use various time-frequency features like intensity, timbre, pitch, tempo, and rhythm [8] to identify the mood of audio. For example, the pitch varies from very high to very low as we move from the “happy” to “sad” mood. Similarly, the intensity and tempo of mood “upbeat” is very high, whereas that of “calm” is very low (see Fig. 2). For mapping/associating audio to design documents below, we use music files corresponding to 50 mood categories in MTG-Jamendo. Since MTG-Jamendo also has audio files labelled with multiple mood categories, we only retain those data samples labelled with only one mood for simplicity.

We use publicly available data from Adobe Stock as our source for collecting multimodal design documents comprising image and text content. To scrape images, we built a software utility that can query Stock with any mood category along with data types as part of the input. For instance, one such query would involve fetching illustrations, vectors, templates, and background images for the adventure mood. By restricting the search-page limit to 10, we obtain about $10,000$ images across all the above document types for every mood category. Note that Adobe Stock also provides image metadata which contains manually generated captions describing the image elements in detail.

3.2 Establishing Correspondence & Dataset Statistics

In our dataset, images and text are already paired since each downloaded image comes with a ground-truth caption. We use the common mood categories across the MTG music dataset and our proposed design document dataset to form image-caption-audio pairs. Specifically, given a mood category, we first extract the CLIP [9] features for an image-text sample. For each audio file tagged with the same mood, we extract Wav2CLIP [10] embeddings and compute cosine similarities between the image-audio embeddings (denoted $s(i,a)$ ) and text-audio embeddings (denoted $s(t,a)$ ). We use the weighted sum $\lambda_{1}s(i,a)+\lambda_{2}s(t,a)$ to retain the audio files corresponding to the top-N similarity scores. We repeat this process for all images in each mood category to curate our audio-design dataset.

Our proposed MELON dataset consists of 488,510 images and corresponding captions and 7,737 music audios belonging to 50 moods/themes. Each mood category consists of $\approx 10k$ images. A distribution plot of the audio samples per category as well as per-mood word clouds to demonstrate the diversity and variability of our dataset are provided in the supplementary material555Link to supplementary document: https://shorturl.at/hsDU5. We plan to make materials publicly available at this link.. In Table 1, we quantitatively compare the proposed MELON dataset with existing datasets. One can note that while existing datasets are focused on natural images and videos, our proposed dataset is unique in the sense it contains creative illustrations, vectors, templates, and background designs with complete descriptions of the image content as part of the caption. With $\approx 500k$ images, our dataset will help the community build robust models for both music retrieval (MR) and classification (C) tasks.

4 Multi-modal Cross-attention Audio Retrieval (MMCAR)

Here, we describe our proposed algorithm for retrieving audio files given an input design document. Our key algorithmic novelty is a multi-modal cross-attention module that operates on feature vectors from all three input modalities (image, text, and audio) and learns a common representation space for the downstream retrieval task.

Figure 3 visually summarizes our proposed algorithm. During training, given input triplets from our MELON dataset, we first use per-modality embedding extractors to compute feature vectors $\mathbf{i}$ , $\mathbf{t}$ , and $\mathbf{a}$ for the image, text, and audio modalities respectively. For $\mathbf{i}$ and $\mathbf{t}$ , we use the CLIP [9] model to obtain 512-dimensional embeddings each. For $\mathbf{a}$ , we train a Resnet-18 model (for audio classification) on the publicly available VGGSound dataset.

Given the $\mathbf{i}$ , $\mathbf{t}$ , and $\mathbf{a}$ embeddings, we propose a multi-modal cross-attention operation to learn a unified multi-modal design-audio embedding. Given the three feature vectors, we perform pairwise cross-attention pooling taking any two modalities ${\mathbf{x}},\boldsymbol{y}\in\{\boldsymbol{i},\boldsymbol{t},\boldsymbol{a}\}$ such that ${\mathbf{x}}\in\mathcal{R}^{d}$ is the query and $\boldsymbol{y}\in\mathcal{R}^{d}$ is the key, resulting in an output $\hat{x}$ . Similarly, by interchanging ${\mathbf{x}}$ and $\boldsymbol{y}$ , we compute the output $\hat{y}$ . These two outputs are then used to compute a common embedding $\mathbf{u}_{xy}$ for this particular pair of $x$ and $y$ . We repeat this for all the possible pairs ( $(x=i,y=t),(x=i,y=a),(x=t,y=a)$ ), and use the corresponding outputs to obtain the proposed unified embedding $\mathbf{u}_{\text{all}}$ as:

[TABLE]

This unified embedding $\mathbf{u}_{\text{all}}$ is then passed to a fully connected neural network unit which generates, with a sigmoid operation, a scalar score $\hat{z}$ in range $\in[0,1]$ . We compare this with the ground truth score $z=1$ (if the input is a correct pair) and $z=0$ otherwise, resulting in a binary cross-entropy training objective

[TABLE]

where B is the batch size. During inference, given a design document input and a database of $n$ audio samples from which to retrieve relevant files, our model computes the similarity scores $\hat{z}_{1},\hat{z}_{2},...,\hat{z}_{n}$ for the input image-text pair with all the $n$ audio samples. Given these scores, we pick the audio files corresponding to the top- $k$ highest scores as the retrieval results (see Fig 3 (right)).

5 Experiments and Results

Since our proposed problem, data, and algorithm is centered around audio retrieval given image-text design documents, the closest baselines in the literature include JTAV [1] and Wav2CLIP [10]. While JTAV operates on an image, its caption, and the textual lyrics of an audio to learn features, Wav2CLIP uses the CLIP [9] image encoder and an audio autoencoder to map audio and image features close.

To benchmark the performance of these algorithms and compare them to our proposed method on our new dataset, we use rank-based evaluation metrics proposed in prior work [1]. In particular, we use the Med r $\in[1,M]$ metric that represents the medium rank of the ground-truth retrieved audio, where $M$ is the maximum number of classes considered. A lower value of Med r indicates better performance. We also use recall@k ( $k=\{1,5,10,15,20\}$ ) that is the fraction of ground-truth audios retrieved in the top-k ranked items across all test cases, and higher values indicate better performance.

For training and evaluating all models, we selected 38 moods in our dataset based on maximum uniqueness among all the 50 moods (see Table 2 for a list). We construct the training split using $60\%$ of the image-caption pairs and audio samples. Each triplet comprises the image, caption, and audio along with a $1/0$ label based on the correct audio mapping. To form positive triplets, we select $10$ audio samples for each design based on mood as discussed in Section 3.2. The negative triplets are formed by randomly selecting $5$ different moods from the mood of the image and selecting $2$ audio samples from each mood. The remaining $40\%$ of the data is equally split to obtain validation and test splits with $20\%$ of the samples each.

We next present our evaluation results. Since we seek to retrieve the best matched audio file in terms of mood, we first compute a mean feature vector for each mood from our audio repository. Then, given the feature vector for an input design document, we generate a score vector $\boldsymbol{\hat{z}}\in\mathcal{R}^{M}$ , where $M=38$ as noted above, for the mean features of all the audio files. We then compute the recall@k and Med r metrics using the reference mood label.

In Table 3, we compare the performance of our proposed MMCAR algorithm with JTAV and Wav2CLIP baselines on both val and test sets. One can note that our proposed MMCAR gives the lowest $Med$ $r$ of $3.9$ and the highest recall accuracy at all ranks, e.g., $79\%$ accuracy at k=5 on the test set compared to JTAV’s $30\%$ and Wav2CLIP’s $39\%$ , accounting for more than $100\%$ relative improvements. For an even more fair comparison, we also reimplemented our MMCAR with Wav2CLIP’s features (noted MMCAR* in table). While MMCAR* leads to performance degradation when compared to our original, end-to-end-learned MMCAR model, it is still substantially better than the baseline Wav2CLIP approach. This provides evidence for our multimodal cross-attention module’s discriminative capabilities in the shared image-text-audio feature space.

To provide additional evidence, we show t-SNE plots of the learned design document embeddings in Figure 5 where one can see a clearer clustering, compared to baseline approaches, of the features according to the mood using the proposed MMCAR approach. In Figure 4, we compare MMCAR’s confusion matrix with the baseline ones for a random selection of 14 moods, where one can note while baseline predictions are biased towards a few specific moods, the proposed method has a close-to-diagonal matrix as expected.

6 Summary

We considered and proposed a new problem of retrieving relevant audio files given multimodal design documents as input. In the absence of any relevant datasets in the literature, we built and presented a first-of-its-kind multimodal design-audio dataset called MELON comprising hundrends of thousands of design files with mapped audio files. We then proposed a multimodal cross attention algorithm that enables training neural networks to learn a joint image-text-audio feature space for design documents and used it to retrieve relevant audios given a certain design input at test time. We benchmarked our algorithm against the existing state of the art on our new dataset and hope that this will spur further research in this area.

7 ACKNOWLEDGEMENTS

The authors would like to thank Dr. Sriram Ganapathy of LEAP Lab, Indian Institute of Science, Bangalore, for his valuable input and help in offering the required resources to run the experiments.

Bibliography10

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Hongru Liang, Haozheng Wang, Jun Wang, Shaodi You, Zhe Sun, Jin-Mao Wei, and Zhenglu Yang, “JTAV: Jointly learning social media content representation by fusing textual, acoustic, and visual features,” in Proceedings of the 27th International Conference on Computational Linguistics , 2018, pp. 1269–1280.
2[2] Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman, “Vggsound: A large-scale audio-visual dataset,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2020, pp. 721–725.
3[3] Xuelong Li, Di Hu, and Xiaoqiang Lu, “Image 2song: Song retrieval via bridging image content and lyric words,” in Proceedings of the IEEE International Conference on Computer Vision , 2017, pp. 5649–5658.
4[4] Dmitry Bogdanov, Minz Won, Philip Tovstogan, Alastair Porter, and Xavier Serra, “The mtg-jamendo dataset for automatic music tagging,” in ICML , 2019.
5[5] Thomas Hayes, Songyang Zhang, Xi Yin, Guan Pang, Sasha Sheng, Harry Yang, Songwei Ge, Isabelle Hu, and Devi Parikh, “Mugen: A playground for video-audio-text multimodal understanding and generation,” ar Xiv preprint ar Xiv:2204.08058 , 2022.
6[6] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) . IEEE, 2017, pp. 776–780.
7[7] Sicheng Zhao, Yaxian Li, Xingxu Yao, Weizhi Nie, Pengfei Xu, Jufeng Yang, and Kurt Keutzer, “Emotion-based end-to-end matching between image and music in valence-arousal space,” in Proceedings of the 28th ACM International Conference on Multimedia , 2020, pp. 2945–2954.
8[8] Aathreya S Bhat, VS Amith, Namrata S Prasad, and D Murali Mohan, “An efficient classification algorithm for music mood detection in western and hindi music using audio feature extraction,” in fifth international conference on signal and image processing . IEEE, 2014, pp. 359–364.