Outfit Compatibility Prediction and Diagnosis with Multi-Layered   Comparison Network

Xin Wang; Bo Wu; Yun Ye; Yueqi Zhong

arXiv:1907.11496·cs.CV·August 23, 2019

Outfit Compatibility Prediction and Diagnosis with Multi-Layered Comparison Network

Xin Wang, Bo Wu, Yun Ye, Yueqi Zhong

PDF

1 Repo

TL;DR

This paper introduces a multi-layered comparison network for predicting and diagnosing outfit compatibility, leveraging hierarchical features and pairwise similarities, supported by a new dataset, Polyvore-T, demonstrating improved accuracy and interpretability.

Contribution

The paper presents a novel end-to-end framework that predicts and diagnoses outfit compatibility using multi-layered feature comparison and gradient-based diagnosis, along with a new dataset Polyvore-T.

Findings

01

Outperforms prior methods in compatibility prediction accuracy.

02

Provides effective diagnosis of incompatible factors.

03

Achieves better interpretability through hierarchical feature analysis.

Abstract

Existing works about fashion outfit compatibility focus on predicting the overall compatibility of a set of fashion items with their information from different modalities. However, there are few works explore how to explain the prediction, which limits the persuasiveness and effectiveness of the model. In this work, we propose an approach to not only predict but also diagnose the outfit compatibility. We introduce an end-to-end framework for this goal, which features for: (1) The overall compatibility is learned from all type-specified pairwise similarities between items, and the backpropagation gradients are used to diagnose the incompatible factors. (2) We leverage the hierarchy of CNN and compare the features at different layers to take into account the compatibilities of different aspects from the low level (such as color, texture) to the high level (such as style). To support the…

Tables7

Table 1. (a) The statistics before and after data cleaning

		Top	Bottom	Shoe	Bag	Accessory	Item	Outfit
Before	Train	-	-	-	-	-	114806	17316
	Val	-	-	-	-	-	9070	1497
	Test	-	-	-	-	-	18604	3076
After	Train	13764	14849	15268	12640	12093	68614	16176
	Val	962	1052	1124	948	823	4904	1196
	Test	2000	2153	2314	1994	1712	10173	2463

Table 2. (a) The statistics before and after data cleaning

		Top	Bottom	Shoe	Bag	Accessory	Item	Outfit
Before	Train	-	-	-	-	-	114806	17316
	Val	-	-	-	-	-	9070	1497
	Test	-	-	-	-	-	18604	3076
After	Train	13764	14849	15268	12640	12093	68614	16176
	Val	962	1052	1124	948	823	4904	1196
	Test	2000	2153	2314	1994	1712	10173	2463

Table 3. (b) Example categories in each type group

Type	Category
Top	Jackets, Sweaters, Coats …
Bottom	Dresses, Skinny Jeans, Shorts …
Shoe	Pumps, Sandals, Ankle Booties …
Bag	Tote Bags, Clutches, Handbags …
Accessory	Earrings, Necklaces, Sunglasses …

Table 4. Table 2 . Outfit compatibility prediction AUC and FITB accuracy on Polyvore-T dataset.

Method	AUC(%)	FITB(%)
Pooling(Li et al., 2017)	$88.35 \pm 0.26$	$57.28 \pm 0.31$
Concatenation(Tangseng et al., 2018)	$83.40 \pm 0.48$	$52.91 \pm 0.59$
Self-Attention (Wang et al., 2018)	$79.65 \pm 0.68$	$48.60 \pm 0.70$
CSN(Vasileva et al., 2018)	$84.90 \pm 0.52$	$57.06 \pm 1.70$
BiLSTM(Han et al., 2017)	$74.44 \pm 0.95$	$45.41 \pm 0.40$
BiLSTM+VSE(Han et al., 2017)	$74.82 \pm 0.63$	$46.02 \pm 0.62$
Ours	$91.90 \pm 0.40$	$64.35 \pm 0.92$

Table 5. Table 3 . Ablation study for each part in MCN framework including the Comparison Module (CM), Visual Semantic Embedding (VSE) and Projected Embedding (PE).

Method	AUC(%)	FITB(%)	#Param
Pooling(Li et al., 2017)	$88.35 \pm 0.26$	$57.28 \pm 0.31$	$10^{6}$
Concatenation(Tangseng et al., 2018)	$83.40 \pm 0.48$	$52.91 \pm 0.59$	$10^{6}$
CM	$87.06 \pm 1.08$	$60.07 \pm 0.32$	$10^{2}$
CM + VSE	$88.52 \pm 0.94$	$61.04 \pm 0.81$	$10^{2}$
CM + PE	$90.16 \pm 0.81$	$63.20 \pm 0.79$	$10^{3}$
CM + VSE + PE	$90.70 \pm 0.54$	$62.47 \pm 0.19$	$10^{3}$

Table 6. Table 4 . Results of different layers of MLP to learn overall compatibility from all pairwise similarities.

Method	AUC(%)	FITB(%)
CSN(Vasileva et al., 2018)	$84.90 \pm 0.52$	$57.06 \pm 1.70$
CM + Avg	$85.28 \pm 0.54$	$55.62 \pm 0.89$
CM + Linear	$88.58 \pm 0.85$	$60.91 \pm 1.19$
CM + 2FC	$90.70 \pm 0.54$	$62.47 \pm 0.19$
CM + 3FC	$88.53 \pm 0.60$	$61.40 \pm 0.75$

Table 7. Table 5 . Prediction performance with representation from different layers.

Method	AUC(%)	FITB(%)
Layer 4	$90.70 \pm 0.54$	$62.47 \pm 0.19$
Layer 4+3	$90.53 \pm 0.69$	$64.14 \pm 1.06$
Layer 4+3+2	$90.92 \pm 0.53$	$64.53 \pm 0.58$
Layer 4+3+2+1	$91.90 \pm 0.40$	$64.35 \pm 0.92$

Equations23

R = r_{11} r_{21} ⋮ r_{N 1} r_{12} r_{22} ⋮ r_{N 2} \dots \dots ⋱ \dots r_{1 N} r_{2 N} ⋮ r_{N N}

R = r_{11} r_{21} ⋮ r_{N 1} r_{12} r_{22} ⋮ r_{N 2} \dots \dots ⋱ \dots r_{1 N} r_{2 N} ⋮ r_{N N}

s = W_{2} R e LU (W_{1} [R^{1}; R^{2}; \dots; R^{K}] + b)

s = W_{2} R e LU (W_{1} [R^{1}; R^{2}; \dots; R^{K}] + b)

s \approx W^{†} [R^{1}; R^{2}; \dots; R^{K}] + b

s \approx W^{†} [R^{1}; R^{2}; \dots; R^{K}] + b

w_{ij}^{k}=\frac{\partial s}{\partial r_{ij}^{k}}\Bigr{|}_{[R_{0}^{1};R_{0}^{2};\dots;R_{0}^{K}]}.

w_{ij}^{k}=\frac{\partial s}{\partial r_{ij}^{k}}\Bigr{|}_{[R_{0}^{1};R_{0}^{2};\dots;R_{0}^{K}]}.

ω_{q} = 1 \leq k \leq K \sum i = q, j \neq = q \sum w_{ij}^{k}

ω_{q} = 1 \leq k \leq K \sum i = q, j \neq = q \sum w_{ij}^{k}

L_{c l f} = y \cdot lo g σ (s) + (1 - y) \cdot lo g (1 - σ (s))

L_{c l f} = y \cdot lo g σ (s) + (1 - y) \cdot lo g (1 - σ (s))

f (x_{i}, x_{j}) = d (P^{i \to (i, j)} x_{i}, P^{j \to (i, j)} x_{j})

f (x_{i}, x_{j}) = d (P^{i \to (i, j)} x_{i}, P^{j \to (i, j)} x_{j})

P^{i \to (i, j)} x_{i} = R e LU (x_{i} \otimes m_{(i, j)})

P^{i \to (i, j)} x_{i} = R e LU (x_{i} \otimes m_{(i, j)})

L_{ma s k} (m) = ∣∣ m ∣ ∣_{1}, L_{e mb} (x) = ∣∣ x ∣ ∣_{2}

L_{ma s k} (m) = ∣∣ m ∣ ∣_{1}, L_{e mb} (x) = ∣∣ x ∣ ∣_{2}

L_{v se} (v, u; W_{T}, W_{I}) =

L_{v se} (v, u; W_{T}, W_{I}) =

+ v \sum k \sum ma x (0, m - d (v, u) + d (v, u_{k}))

L_{t o t a l} = L_{c l f} + λ_{1} L_{e mb} + λ_{2} L_{ma s k} + λ_{3} L_{v se}

L_{t o t a l} = L_{c l f} + λ_{1} L_{e mb} + λ_{2} L_{ma s k} + λ_{3} L_{v se}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

WangXin93/fashion_compatibility_mcn
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Outfit Compatibility Prediction and Diagnosis with Multi-Layered Comparison Network

Xin Wang

[email protected]

0000-0002-4315-1867

Donghua University & JD AI ResearchShanghaiChina

,

Bo Wu

[email protected]

Columbia University & JD AI ResearchNew YorkUSA

,

Yun Ye

[email protected]

JD AI ResearchBeijingChina

and

Yueqi Zhong

[email protected]

Donghua UniversityShanghaiChina

(2019)

Abstract.

Existing works about fashion outfit compatibility focus on predicting the overall compatibility of a set of fashion items with their information from different modalities. However, there are few works explore how to explain the prediction, which limits the persuasiveness and effectiveness of the model. In this work, we propose an approach to not only predict but also diagnose the outfit compatibility. We introduce an end-to-end framework for this goal, which features for: (1) The overall compatibility is learned from all type-specified pairwise similarities between items, and the backpropagation gradients are used to diagnose the incompatible factors. (2) We leverage the hierarchy of CNN and compare the features at different layers to take into account the compatibilities of different aspects from the low level (such as color, texture) to the high level (such as style). To support the proposed method, we build a new type-specified outfit dataset named Polyvore-T based on Polyvore dataset. We compare our method with the prior state-of-the-art in two tasks: outfit compatibility prediction and fill-in-the-blank. Experiments show that our approach has advantages in both prediction performance and diagnosis ability.

Outfit Compatibility; Fashion Recommendation; Deep Learning

††journalyear: 2019††copyright: acmcopyright††conference: Proceedings of the 27th ACM International Conference on Multimedia; October 21–25, 2019; Nice, France††booktitle: Proceedings of the 27th ACM Int’l Conf. in Multimedia (MM’19), Oct. 21–25, 2019, Nice, France††price: 15.00††doi: 10.1145/3343031.3350909††isbn: 978-1-4503-6889-6/19/10††ccs: Computing methodologies Visual content-based indexing and retrieval††ccs: Computing methodologies Neural networks††ccs: Applied computing Online shopping

1. Introduction

Outfit compatibility refers to whether multiple fashion items look good if worn together. Learning outfit compatibility can help consumers compose satisfactory outfits and help enterprises analyze the market. Many works used multimedia information to answer the questions like “Is this outfit compatible?” (Li et al., 2017; Vasileva et al., 2018; Tangseng et al., 2018), “How to generate a compatible outfit?” (Han et al., 2017; Nakamura and Goto, 2018). However, they lack explanations about the prediction and generation results i.e., “Why is the outfit compatible?”.

Outfit diagnosis means pointing out the incompatible factors in the fashion outfit. It provides not only the compatibility prediction but also the factors lead to incompatibility as the explanation. There are several reasons to diagnose the outfit: 1) For researchers, it helps to understand the essence of compatibility. 2) For consumers, the explanations make the prediction and generation result more convincing. 3) For companies and designers, it provides hints for adjusting the incompatible outfit to be a more compatible and popular one. What does the diagnosis result look like? In this work, it has a format like “the outfit is incompatible due to the incompatible color of the trousers and shoes”. As illustrated in Figure 1, Imaging a person is given an outfit for evaluation, what he or she would do is comparing each item with others and the outfit compatibility is a holistic feeling about all of the comparisons. The primitive unit in this thinking process is a single pairwise compared compatibility. A single item does not have the compatibility problem. On the other hand, fashion compatibility is evaluated in different aspects ranging from low level (such as color, texture) to high level (such as style). (Zou et al., 2016) has shown that low-level features may directly determine the aesthetic feeling of fashion.

Outfit compatibility prediction is a binary classification problem. To compute the compatibility from variable lengths of fashion items in the outfit, previous works explored the pooling operation (Li et al., 2017), concatenation (Tangseng et al., 2018) or LSTM (Han et al., 2017) to consume the CNN features and output a score. Even the deeply extracted features can achieve impressive performance, it is hard to explain the meaning of each feature dimension. Different from the classification problem, the compatibility problem is irrelevant to the feature contents but lie at the comparisons between features. Besides, in previous works, only the CNN features of the last layer are used which contain strong semantic information. As mentioned before, the low-level feature is of great importance since it not only provides different aspects for diagnosis but also affects the compatibility.

In this work, we introduce an end-to-end framework called Multi-Layered Comparison Network (MCN) to take both prediction and diagnosis into consideration. First, the overall compatibility is thought as a holistic feeling after comparing each fashion item with others, so we introduce the comparison module, which compares enumerated pairwise similarities between the input feature vectors as the output. The compatibility score is learned from all of these similarities with a non-linear model. To diagnose the outfit, we use the backpropagation gradient of each input similarity to approximate its importance for the incompatibility (Simonyan et al., 2013; Springenberg et al., 2014). The similarity with the largest gradient value has the most critical problem. Second, to construct features in different aspects, we leverage the hierarchical architecture of CNN, which enables the model to learn semantically stronger information as the layers going deeper. We construct the comparison modules at different layers of the backend CNN. In the early layers, the comparisons concern more about the low-level features (such as color, shape), in the later layers they concern more about the high-level features (such as category, style).

To effectively implement the comparison module, we follow the ideas from the projected embedding (Chen et al., 2018) and type-aware embedding (Vasileva et al., 2018). To leverage both image and text information of fashion items, we adopt visual semantic embedding (Kiros et al., 2014) to learn a common representation between them. We provide a cleaner outfit dataset named Polyvore-T for the experiment, which is constructed by adding type labels and removing items that are not relevant to fashion based on (Han et al., 2017). Our approach can also be generalized to other sources of dataset. For example, to handle the fashion blogs with richer textual components, named entity recognition can be firstly used to extract the descriptions of each part in outfits, then these descriptions can be fed to our framework. We show that the diagnosis result of our method is corresponding to human cognition of the outfit compatibility. The prediction performance is tested in two fashion tasks: outfit compatibility prediction and fill-in-the-blank, in which our approach shows superiority comparing to previous works. In summary, the main contributions of this work include:

•

We propose to diagnose the compatibility of the outfit, which is implemented by using the gradient values to approximate the importance of input similarities.

•

We propose to incorporate comparison modules at different layers of a convolutional neural network to account for different levels of semantic information.

•

We build a cleaner Polyvore-T dataset for experiments and show that our framework outperforms other baselines in outfit compatibility prediction and fill-in-the-blank tasks.

2. Related Work

2.1. Fashion Recognition and Understanding

Fashion recognition is the pioneer and basis of fashion compatibility learning. A significant amount of researches has been made to push the margin of this area in the deep learning community. Kiapout et al. (Kiapour et al., 2014) use a supervised method to learn 5 different fashion style with the support of human pose estimation. Liu et al. (Liu et al., 2016) propose a large-scale fashion dataset called DeepFashion and a deep network jointly training the fashion categories, attributes, and landmarks. Edgar et al. (Simo-Serra and Ishikawa, 2016) propose to learn fashion embedding using weakly-labeled data from massive online product descriptions. Lee et al. (Lee et al., 2017) leverage the context in fashion outfits and use a skip-gram model to learn a better fashion representation. The recognition of fashion is also affected by context information such as the city (Chang et al., 2017), location (Zhang et al., 2017), scenery (Simo-serra et al., 2017) and trend (Ling et al., 2019). For unsupervised learning, Hisao et al. (Iwata et al., 2011; Hsiao and Grauman, 2017) use the topic model to map fashion attributes to several fashion styles by considering fashion attributes as words, outfits as documents and styles as topics. Al-Halah et al. (Al-Halah et al., 2017) introduce a non-negative matrix factorization method to project the deeply extracted features into a fashion style space.

2.2. Visual Compatibility Learning

Even though compatibility is a subjective aesthetic standard, massive user feedbacks like co-view and co-purchase records (McAuley et al., 2015; Veit et al., 2015) or proposals from the fashion community (Han et al., 2017) can be used to learn the rule of compatibility. Recent works about outfit compatibility can be divided into two groups. The first group focuses on learning the pairwise compatibility, another group learns outfit compatibility in an end-to-end manner.

Pairwise compatibility can be learned as a metric by mapping item features into a compatibility space then compute their euclidean distance (McAuley et al., 2015; Veit et al., 2015). Further researches relief this metric since compatibility is not as strict as retrieval tasks. He et al. (He et al., 2016a) mix multiple metrics with different confidence to achieve more robust performance. Shih et al. (Shih et al., 2017) propose to make multiple projection points of a query image. Vasileva et al. (Vasileva et al., 2018) learn different metric spaces for different type combinations. While evaluating the outfit compatibility, these methods take the average of all pairwise compatibility as the output and neglect the relationship between the pairwise compatibility and the overall compatibility.

For end-to-end training outfits, pooling (Li et al., 2017) or concatenation (Tangseng et al., 2018) operations can be used to aggregate multiple item features. Then the multilayer perceptron (MLP) can be used to compute a compatibility score. However, the addition operation in MLP is not an efficient way to learn interactions between each input (Qu et al., 2017). Han et al. (Han et al., 2017) propose to use BiLSTM to construct dependency between several fashion items. Based on that, Nakamura et al. (Nakamura and Goto, 2018) build an autoencoder upon the BiLSTM to learn the style embedding of outfits, which has similar idea as controlling the style of image captions (Gan et al., 2017). However, an outfit is more like a set rather than a sequence because it is disordered.

3. Approach

Our approach aims to diagnose the outfit in different aspects, which is implemented by an end-to-end framework called Multi-layered Comparison Network (MCN). There are four parts in MCN: it starts from the multi-layered feature extractor (Section 3.3) which extract features in different aspects. Then the comparison modules (Section 3.2) compare enumerated pairwise similarities between features along with multiple layers and the MLP predictor compute a score from them all. Visual semantic embedding (Section 3.4) is used to handle multimodal information. The workflow of MCN is first predicting the outfit compatibility then backpropagating a gradient from the output score till the input similarities. Similar works are (Simonyan et al., 2013; Springenberg et al., 2014), which visualize the response of each pixel to different classes for interpreting the CNN classification, but our goal is to visualize the response of each similarity to the compatibility score.

In the following contents of this section, we first show the diagnosis process (Section 3.1) then introduce each part in the framework. The overall scheme of MCN is presented in Figure 2.

3.1. Outfit Diagnosis by Gradients

Outfit compatibility can be considered as a summary after comparing each item with others in different aspects, e.g., color, texture and style etc. To learn the relationship between the holistic compatibility and pairwise similarities, a simple linear model can be used and has great interpretability, because the weight of each input dimension indicate the importance for the output. The drawback of the linear model is the limited capacity. In contrast, multilayer perceptron (MLP) model has better capacity but the hidden unit is hard to explain. To leverage the advantage of MLP while maintaining the interpretability for diagnosis, we use the backpropagation gradients to approximate the importance of each similarity concerning the incompatibility.

Given an outfit containing $N$ items, their features in a specific aspect, e.g. the color, can be denoted as a set $X=\{x_{1},x_{2},\dots,x_{N}\}$ where $x_{i}$ is the vector for the $i$ -th item. The enumerated pairwise similarities among the features can be represented as a matrix form:

[TABLE]

Here, $r_{ij}$ is the similarity between $x_{i}$ and $x_{j}$ , which is an undirected relationship, formally $r_{ij}=r_{ji}$ . We call $R$ the comparison matrix. For comparisons in $K$ different aspects, there are $K$ different comparison matrices $\{R^{1},R^{2},\dots,R^{K}\}$ . All elements in these comparison matrices will be fed into a 2-layer MLP to compute the score of outfit compatibility:

[TABLE]

Here, $[.,.]$ denotes concatenating multiple matrices and flattening them into a vector. The non-linear relationship between $s$ and multiple $R$ can be helpful for a better prediction performance but make it hard to estimate the importance of each input to the compatibility score like a simple linear model. To solve this, we use the derivative of $s$ to approximate the importance of each input similarities. One interpretation of this operation is if we compute the first-order Taylor expansion of the MLP model:

[TABLE]

the approximated equation is a linear model and the weight $W^{\dagger}$ can be easily interpreted as the importance of each input similarity. The element of $W^{\dagger}$ is the derivative of $s$ with respect to the point of input similarities $[R_{0}^{1};R_{0}^{2};\dots;R_{0}^{K}]$ :

[TABLE]

Suppose incompatible outfits are labeled as $1$ , $w_{ij}^{k}$ can be interpreted as the importance of each similarity concerning the incompatibility (If incompatible outfits are [math], just use the opposite number of $w_{ij}^{k}$ ). The importance of each item can be computed by summing up the gradients of all similarities containing it:

[TABLE]

where $\omega_{q}$ is the diagnosed importance of the $q$ -th item. By substituting the most problematic item, we can revise the outfit to be a more compatible one with very little change to the original composition. During the training, we use the sigmoid function $\sigma(s)=\frac{1}{1+\exp(-s)}$ to model the output score as the compatibility probability and use the binary cross entropy as the loss function:

[TABLE]

3.2. Comparison with Projected Embedding

An important problem is how to compute each pairwise similarity $r_{ij}$ in the comparison matrix. The simplest way is adopting cosine similarity to measure the distance of features in a common space. However, a common space will induce some undesired consequences: (1) The variation of compatibility is compressed. For example, all trousers compatible with a shirt will be forced to close with each other, which means these trousers must be compatible with each other even they are not. (2) The triangle inequality will limit the position of each embedding. The triangle inequality states that for any three items, the sum of any two pairwise embedding distances should be greater than or equal to the remaining pairwise embedding distance. This will leads to an improper situation e.g., if a shirt matches a trouser, the trouser then matches a shoe, the consequence is the shoe is forced to match the shirt. Therefore, we use the projected embedding with respect to different fashion type combinations to address the above problems, which refers to (Chen et al., 2018; Vasileva et al., 2018).

A fashion outfit often contains items from different types such as top, bottom and shoe etc. Different kinds of pairwise type combinations can be used as conditions for projecting the embedding to different subspaces. Let $r_{ij}=f(x_{i},x_{j})$ as the similarity between $x_{i}$ and $x_{j}$ . We compute the similarity following a projection process:

[TABLE]

Here, $P^{i\to(i,j)}$ is the projection of $i$ -th item conditioned on combination $(i,j)$ . $d(.,.)$ is the cosine similarity. The projection in this work is implemented as:

[TABLE]

where $m_{(i,j)}$ is a learnable mask vector with the same dimension as the features $x_{i}$ , $\otimes$ denotes the element-wise product operation. The mask $m_{(i,j)}$ works as an element-wise gating function which selects relevant elements in the feature vector for different compatibility conditions. We add two additional loss to penalize the training (Veit et al., 2017): $\mathcal{L}_{mask}$ aims to encourage the masks to be sparse and $\mathcal{L}_{emb}$ encourages the CNN to encode normalized representations in the latent space:

[TABLE]

3.3. Multi-Layered Representation

The representation of fashion items in different aspects not only enables the model to diagnose the outfit in a detailed manner but also feed rich information for compatibility prediction. To construct different representations, one way is to predefine several concepts such as color, texture and shape etc. However, these definitions may not completely cover the fashion descriptions. Predefinition is not the only way for representations in different aspects. Modern deep learning use CNN which deeply connects multiple convolutional layers. Even though the convolutional kernel in each layer has a small size but with the architecture going deeper, the receptive field of the model can become larger. Therefore, the former layers in CNN capture low-level features such as color, texture, the later layers capture high-level features such as fashion style and compatibility. This characteristic of CNN can be used for building different representations of fashion items from low level to high level.

Leveraging this fact, we construct comparison modules at different layers of the backend CNN model. In detail, we use Global Average Pooling (GAP) to reduce the features maps at different CNN middle layers into vectors (Lin et al., 2013). GAP has two benefits in this situation. First, GAP transforms feature maps into vectors to satisfy the comparison operation in Equation 7. Second, fashion features like color, texture etc. are irrelevant to location and GAP can effectively get rid of the spatial information. Feature vectors of $K$ layers are fed into $K$ different comparison module to compute enumerated pairwise similarities $\{R^{1},R^{2},\dots,R^{K}\}$ , which is the input for computing overall compatibility in Equation 2. In experiment, we use ResNet-50 (He et al., 2016b) as the backend network, 4 different layers are chosen for multi-layered representation including “conv2_x”, “conv3_x”, “conv4_x”, “conv5_x”.

3.4. Visual Semantic Embedding

In real-life situations, fashion items are often described by multimodal information such as image, tags etc. Visual Semantic Embedding (VSE) (Kiros et al., 2014) is a method to take full advantage of information from different modalities by providing a common expression between them(Han et al., 2017; Nakamura and Goto, 2018).

For one fashion item in Polyvore dataset, its associated text description is like “classic skinny jeans”, which can be denoted as $S=\{w_{1},w_{2},\cdots,w_{n}\}$ , where $w_{i}$ is $i$ -th word and can be represented as one-hot vector $e_{i}$ . The word embedding of $e_{i}$ is $v_{i}=W_{T}e_{i}$ and the semantic embedding of an outfit is $v=\frac{1}{M}\sum_{i=1}^{M}v_{i}$ , where $W_{T}$ is the weight of word embedding model. A similar process is made to project the visual features $x$ into the embedding space as $u=W_{I}x$ .

The object of VSE is to make $u$ and $v$ close to each other in the joint space when they are from the same item. Minimizing the following contrastive loss can achieve that object:

[TABLE]

where $d(u,v)$ is the function to compute the distance between two embeddings. For one matching (from the same item) pair $v$ and $u$ , $v_{k}$ denotes the semantic embeddings of all possible non-matching items, $u_{k}$ denotes the visual embeddings of all possible non-matching items. This loss function expects the distances of all matching pairs $u$ and $v$ are smaller than all of the other non-matching pairs with a margin $m$ . In practice, we can use the mini-batch as the set to enumerate all $u_{k}$ and $v_{k}$ .

To the end, our framework can be jointly trained in an end-to-end manner with a total loss $\mathcal{L}_{total}$ and $\lambda_{\{1,2,3\}}$ is the weights for each additional losses. The trainable parameters includes $\{\Theta$ , $W_{T}$ , $W_{I}$ , $W_{2}$ , $W_{1}$ , $b$ , $M\}$ where $\Theta$ is the parameters for the backend CNN model, $W_{2},W_{1},b$ are the parameters of MLP in Equation 2, $M$ denotes the masks for different conditions in Equation 8.

[TABLE]

4. Experiment

To overcome the subjective nature of fashion sense, a dataset containing the compatible outfits can be constructed by fashion-sensitive people. The randomly combined outfits are highly possible incompatible. In this section, we first introduce our method to build the Polyvore-T dataset. Then we compare the prediction performance of our MCN framework with several baselines. Finally, we show the result of automatically diagnosing and revising the fashion outfit. Code and dataset are released at https://github.com/WangXin93/fashion_compatibility_mcn.

4.1. Polyvore-T Dataset

We build a type-labeled fashion outfit dataset upon the Polyvore dataset (Han et al., 2017). Original Polyvore dataset contains 21889 expertise-selected outfits. Graph segmentation is used to split train, validation and test dataset. Each item has the corresponding image, text description, votes, and category (like jeans, skirts, sports, totally 381 kinds of category). However, there are overlaps between these categories such as “shoulder bags“ and “bags“ etc. At the same time, some categories do not have insufficient samples for training. So we label type information for each item by grouping 381 categories into 5 types with the following method: (1) Unrelated categories such as “lipsticks”, “mirrors” are filtered and there are 158 categories retained after filtering. (2) The filtered categories are classified into 5 types (top, bottom, shoes, bag, accessory) by hand. (3) The item not in the category list will be dropped and the left items can be labeled according to their categories. When there are multiple items of the same type in one outfit, only the first item will be taken. We filtered out outfits in the dataset with less than 3 types to make sure each outfit in our dataset has at least 3 items and up to 5 items. The statistics of the filtered dataset are shown in Table 1. Type information enables the comparisons are conditioned on the type combination as discussed in Section 3.2. It also prevents repeating items such as two pairs of trousers in one outfit.

4.2. Experiment Settings

Training. In all experiments, we use ImageNet(Russakovsky et al., 2015) pre-trained ResNet-50 (He et al., 2016b) model as the backbone model unless specified. The spatial size of the input images is $224\times 224$ . The input outfit has a variable length from 3 to 5. The empty parts are handled by padding a mean image of that type. Batch normalization layer (Ioffe and Szegedy, 2015) is added after the comparison modules to normalize the similarities of the same type combination. All experiments are trained on a single Tesla P40 GPU, each input mini-batch has 32 outfits. It takes about 9 hours to train 50 epochs. The initial learning rate is $1e^{-2}$ and decay by a factor of 0.2 every 10 epochs. The optimizing strategy is SGD with a momentum of 0.9. We set the weights of additional losses $\lambda_{\{1,2,3\}}$ are $5e^{-3}$ , $5e^{-4}$ and 1 respectively. Only model parameters with the best performance on the validation set will be saved.

Negative Sample. During the training, the positive samples or compatible outfits come from the ground truth dataset. Negative samples are generated by substituting each item in the positive outfit samples with a random item of the same type from the dataset. It is because in real life fashion designers focus on composing a compatible outfit, but seldom think how to build an incompatible outfit. The subjectivity in incompatibility also makes the dataset labeling unreliable. Since fashion experts in communities like Polyvore have composed the outfits with the different aesthetic rules, randomly composed outfits are highly possible incompatible.

4.3. Tasks

The performance of outfit compatibility prediction is evaluated in two tasks: outfit compatibility prediction AUC and fill-in-the-blank accuracy. Figure 3 gives an illustration of the two tasks. During the test, each model is evaluated 5 times with different dynamically generated test sets.

**Outfit Compatibility Prediction (AUC): ** The goal of the outfit compatibility prediction task is to compute a score as overall compatibility. MCN predicts the score from multiple input images in an end-to-end way. We generate test set following Section 4.2 with a positive and negative ratio of $1:1$ . The area under the ROC curve is used to compare different methods.

**Fill-in-the-blank (FITB): ** Fill-in-the-blank task aims to select the most compatible item with the remainder of the outfit as the question. It is evaluated by the accuracy of correctly answering the question. Each question has 4 options in our experiment. MCN and other baselines do this task by predicting scores of 4 different outfits in which only substituting the blank part with different options, then choose the option with the highest score as the answer.

4.4. Baselines

We compare our method with several recent fashion outfit compatibility works. Their brief descriptions are as follows:

Pooling (Li et al., 2017): It uses average-pooling operation to aggregate the variable length of feature inputs to predict the compatibility.

Concatenation (Tangseng et al., 2018): It concatenates 5 item features into a long vector with a length of $1000\times 5$ , then uses MLP as the binary classifier. We set the size of the hidden layer as 1000.

CSN (Vasileva et al., 2018): It is a metric learning method for pairwise compatibility. The compatibility is computed as the distance of projected embeddings conditioned on different type combination. Outfit compatibility is the average of all pairwise compatibility.

BiLSTM+VSE (Han et al., 2017): Each time step LSTM consumes one CNN encoded features and outputs a hidden state and a prediction of the next item. The cross entropy between the predicted items and the ground-truth is the compatibility score. It jointly optimizes the forward LSTM loss, backward LSTM loss, and VSE loss.

Self-Attention (Wang et al., 2018): It uses the self-attention mechanism to relate different items in an outfit to compute a representation of the outfit. We use the scaled dot-product attention (Vaswani et al., 2017) where the query, key, and value are the item features in the same outfit.

4.5. Prediction Performance

The prediction performance of MCN. The results of outfit compatibility prediction AUC and FITB accuracy are shown in Table 2. It can be observed that there is a clear margin between MCN and other baselines. FITB is a harder task because substituting only one part may have little effect on global estimation. BiLSTM performs not well when there is no repeating item of the same type in the outfit, this result is similar to experiments in (Vasileva et al., 2018). Self-Attention computes the relationships between item features as the attention weights, but it does not perform well in the compatibility task because the compatibility cares more about the feature similarities than the feature contents. The pooling method performs better than several complex methods because it is an interaction between all item features. MCN has the same spirit that learns overall compatibility from all pairwise feature similarities.

Ablation Study To analyze the efficiency of each part in the MCN framework, we conduct an ablation study. The results are shown in Table 3 where CM stands for the comparison module, VSE is the Visual Semantic Embedding and PE is the Projected Embedding. It can be observed that concatenating item features is the most inefficient method. The comparison module without projection embeddings can already achieve competitive performance and the comparison operation uses fewer parameters because of the removal of MLP with long feature vectors. VSE regularizes the training so there is no additional parameter while evaluation; the parameters of PE have the same magnitude with feature vectors.

What is the relationship between the overall compatibility and pairwise similarities? As illustrated in Table 4, end-to-end training with comparison module and metric learning method CSN perform similarly. However, CSN has already sampled from the training dataset, so it can not use the same dataset to learn this relationship between the overall compatibility and pairwise similarities. However, the depth of the MLP predictor in MCN can be finetuned to study this problem. We explore 0 to 3 layers and find there is a plateau on 2-layer MLP. It indicates that there is a non-linear relationship between them.

Multi-layered representations benefit the performance. Figure 4 is an example of retrieval with representation at different layers It can be seen that the results retrieved by the low-level features such as layer 1 are similar at color, texture. The results retrieved by high-level features such as layer 4 have similarity in visual style and compatibility. This observation proves the multi-layered representation in MCN ranges from the low level to the high level. Table 5 analyzes the effects of input from different layers. It can be seen that each layer has its contribution to the performance, which supports that the low-level features have an important effect on fashion compatibility.

4.6. Outfit Compatibility Diagnosis

Diagnosis from Different Layers The diagnosis process computes the importance of each pairwise similarity and item concerning the incompatibility as described in Section 3.1 We show several diagnosis results in Figure 3.1. All diagnosis values are normalized to let margin between the maximum and minimum as 1. The edges with the 3 greatest values are marked as red dash lines. It can be observed that the diagnosis result is corresponding to human. For instance, in Figure 5(c) the green dress is pointed out mainly because its comparison between the black sweater is incompatible.

Automatically Revise the Outfit One benefit of outfit diagnosis is we can revise the outfit based on the diagnosis result, which lets us make very little change to the original composition. We try a simple strategy to substitute the problematic items in the outfit as described in Algorithm 1. Several results are shown in Figure 6. It can be seen that the revised outfit has a better aesthetic feeling while most items are the same as that in the original outfit.

5. Conclusion

In this paper, we introduce an approach to not only predict but also diagnose the fashion outfit. We introduce an end-to-end framework that incorporates the comparison module, multi-layered representation, and visual semantic embedding. The diagnosis process is implemented by first predicting the compatibility score then use the backpropagation gradient to approximate the importance of each compared similarities to the incompatibility. Experiments show that the comparison module is an efficient way to aggregate multiple features for learning outfit compatibility. Representations from different layers also boost the performance which indicates that both low-level features and high-level features have impacts on fashion compatibility. We show that our framework can diagnose the outfit by pointing out the problematic similarities and items, which can be used to interpret the prediction and automatically revise the outfit. For future work, we are planning to explore how to explain different aspects in a better way and quantitively evaluate the diagnosis results.

Acknowledgements.

This work is supported by the National Natural Science Foundation of China (Grant No.61572124). Thanks JD AI Research led by Dr. Tao Mei and thanks Wei Zhang for hist insightful discussions.

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Al-Halah et al . (2017) Ziad Al-Halah, Rainer Stiefelhagen, and Kristen Grauman. 2017. Fashion forward: Forecasting visual style in fashion. ar Xiv preprint ar Xiv:1705.06394 (2017).
3Chang et al . (2017) Yu-Ting Chang, Wen-Haung Cheng, Bo Wu, and Kai-Lung Hua. 2017. Fashion world map: Understanding cities through streetwear fashion. In Proceedings of the 25th ACM international conference on Multimedia . ACM, 91–99.
4Chen et al . (2018) Hongxu Chen, Hongzhi Yin, Weiqing Wang, Hao Wang, Quoc Viet Hung Nguyen, and Xue Li. 2018. PME: projected metric embedding on heterogeneous networks for link prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining . ACM, 1177–1186.
5Gan et al . (2017) Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, and Li Deng. 2017. Stylenet: Generating attractive visual captions with styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 3137–3146.
6Han et al . (2017) Xintong Han, Zuxuan Wu, Yu-Gang Jiang, and Larry S. Davis. 2017. Learning Fashion Compatibility with Bidirectional LST Ms. Proceedings of the 2017 ACM on Multimedia Conference - MM ’17 1 (2017), 1078–1086. https://doi.org/10.1145/3123266.3123394 ar Xiv:1707.05691 · doi ↗
7He et al . (2016 b) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016 b. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition . 770–778.
8He et al . (2016 a) Ruining He, Charles Packer, and Julian Mc Auley. 2016 a. Learning compatibility across categories for heterogeneous item recommendation. In Data Mining (ICDM), 2016 IEEE 16th International Conference on . IEEE, 937–942.