Large-Scale Classification of Structured Objects using a CRF with Deep   Class Embedding

Eran Goldman; Jacob Goldberger

arXiv:1705.07420·cs.CV·November 19, 2019

Large-Scale Classification of Structured Objects using a CRF with Deep Class Embedding

Eran Goldman, Jacob Goldberger

PDF

TL;DR

This paper introduces a deep learning approach combining CNNs and CRFs with class embeddings to classify large sets of visually similar structured objects, effectively modeling contextual relationships.

Contribution

The novel architecture jointly learns visual features and class embeddings within a CRF framework, addressing challenges of large, sparse datasets with many similar categories.

Findings

01

Significantly outperforms linear CRF models on large-scale datasets.

02

Effectively models contextual relationships among classes.

03

Handles data sparsity with a nonlinear training objective.

Abstract

This paper presents a novel deep learning architecture to classify structured objects in datasets with a large number of visually similar categories. We model sequences of images as linear-chain CRFs, and jointly learn the parameters from both local-visual features and neighboring classes. The visual features are computed by convolutional layers, and the class embeddings are learned by factorizing the CRF pairwise potential matrix. This forms a highly nonlinear objective function which is trained by optimizing a local likelihood approximation with batch-normalization. This model overcomes the difficulties of existing CRF methods to learn the contextual relationships thoroughly when there is a large number of classes and the data is sparse. The performance of the proposed method is illustrated on a huge dataset that contains images of retail-store product displays, taken in varying…

Tables2

Table 1. Table 2 : Comparison of the object-level error rate between the different methods and our full approach: Approximate Factorized CRF with BN . The table shows the means and standard deviations of the error percentage over the 5 dataset splits.

Method	Training	$% μ_{e r r o r}$	$% σ_{e r r o r}$
Unary (no context)	CNN	15.61	0.21
BiLSTM	RNN	15.54	0.45
Pairwise Statistics CRF	CV	15.60	0.59
Log-linear CRF	CRF	15.39	0.20
Log-linear CRF	MEMM	14.30	0.09
Factorized CRF	CRF	14.62	0.19
Factorized CRF	MEMM	14.93	0.52
Factorized CRF + BN	MEMM	12.85	0.31

Table 2. Table 3 : Results on MIT OCR dataset of handwritten words.

Architecture	Acc	AP	R^P=0.7	R^P=0.9
Unary	0.74	0.81	0.78	0.59
Log-linear CRF	0.81	0.88	0.87	0.73
Factorized CRF	0.82	0.90	0.89	0.76
Factorized CRF + BN	0.84	0.92	0.91	0.79

Equations46

p (y ∣ x) = \frac{1}{Z} t = 1 \prod n φ (y_{t}, x_{t}, y_{t - 1})

p (y ∣ x) = \frac{1}{Z} t = 1 \prod n φ (y_{t}, x_{t}, y_{t - 1})

φ (y_{t}, x_{t}, y_{t - 1}) = exp (y_{t - 1}^{⊤} P y_{t} + h (x_{t})^{⊤} U y_{t} + b^{⊤} y_{t}) .

φ (y_{t}, x_{t}, y_{t - 1}) = exp (y_{t - 1}^{⊤} P y_{t} + h (x_{t})^{⊤} U y_{t} + b^{⊤} y_{t}) .

P = R^{⊤} Q .

P = R^{⊤} Q .

φ (y_{t}, h_{t}, y_{t - 1}) = exp ((R y_{t - 1})^{⊤} Q y_{t} + h_{t}^{⊤} U y_{t} + b^{⊤} y_{t}) .

φ (y_{t}, h_{t}, y_{t - 1}) = exp ((R y_{t - 1})^{⊤} Q y_{t} + h_{t}^{⊤} U y_{t} + b^{⊤} y_{t}) .

L (R, Q, U, b) = i = 1 \sum s lo g p (y_{i} ∣ h (x_{i}))

L (R, Q, U, b) = i = 1 \sum s lo g p (y_{i} ∣ h (x_{i}))

p (y ∣ h (x)) = t = 1 \prod n p (y_{t} ∣ h_{t}, y_{t - 1})

p (y ∣ h (x)) = t = 1 \prod n p (y_{t} ∣ h_{t}, y_{t - 1})

p (y_{t} ∣ h_{t}, y_{t - 1}) = \frac{1}{Z ( t )} exp (y_{t - 1}^{⊤} R^{⊤} Q y_{t} + h_{t}^{⊤} U y_{t} + b^{⊤} y_{t}) .

p (y_{t} ∣ h_{t}, y_{t - 1}) = \frac{1}{Z ( t )} exp (y_{t - 1}^{⊤} R^{⊤} Q y_{t} + h_{t}^{⊤} U y_{t} + b^{⊤} y_{t}) .

L_{\mbox M E M M} = i = 1 \sum s t = 1 \sum n lo g p (y_{i, t} ∣ h_{i, t}, y_{i, t - 1})

L_{\mbox M E M M} = i = 1 \sum s t = 1 \sum n lo g p (y_{i, t} ∣ h_{i, t}, y_{i, t - 1})

lo g p (y_{t} ∣ h_{t}, y_{t - 1}) = (1 h_{t}^{⊤} (R y_{t - 1})^{⊤}) b^{⊤} U Q y_{t} - lo g Z (t) .

lo g p (y_{t} ∣ h_{t}, y_{t - 1}) = (1 h_{t}^{⊤} (R y_{t - 1})^{⊤}) b^{⊤} U Q y_{t} - lo g Z (t) .

p (y_{t} ∣ h_{t}, y_{t - 1}) = \frac{1}{Z ( t )} exp (B N (R y_{t - 1})^{⊤} Q y_{t} + h_{t}^{⊤} U y_{t} + b^{⊤} y_{t})

p (y_{t} ∣ h_{t}, y_{t - 1}) = \frac{1}{Z ( t )} exp (B N (R y_{t - 1})^{⊤} Q y_{t} + h_{t}^{⊤} U y_{t} + b^{⊤} y_{t})

μ_{B} = \frac{1}{m} i = 1 \sum m x_{i} σ_{B}^{2} = \frac{1}{m} i = 1 \sum m (x_{i} - μ_{B})^{2} .

μ_{B} = \frac{1}{m} i = 1 \sum m x_{i} σ_{B}^{2} = \frac{1}{m} i = 1 \sum m (x_{i} - μ_{B})^{2} .

L_{\mbox c nn} = i = 1 \sum s t = 1 \sum n lo g p (y_{i, t} ∣ x_{i, t})

L_{\mbox c nn} = i = 1 \sum s t = 1 \sum n lo g p (y_{i, t} ∣ x_{i, t})

L_{\mbox M E M M} (R, Q, U, b) = i = 1 \sum s t = 1 \sum n lo g p (y_{i, t} ∣ h_{i, t}, y_{i, t - 1})

L_{\mbox M E M M} (R, Q, U, b) = i = 1 \sum s t = 1 \sum n lo g p (y_{i, t} ∣ h_{i, t}, y_{i, t - 1})

\mbox s . t . p (y_{i, t} ∣ h_{i, t}, y_{i, t - 1}) \propto

\mbox s . t . p (y_{i, t} ∣ h_{i, t}, y_{i, t - 1}) \propto

exp (B N (R y_{i, t - 1})^{⊤} Q y_{i, t} + h_{i, t}^{⊤} U y_{i, t} + b^{⊤} y_{i, t})

exp (B N (R y_{i, t - 1})^{⊤} Q y_{i, t} + h_{i, t}^{⊤} U y_{i, t} + b^{⊤} y_{i, t})

p (y ∣ x) \propto exp (t = 1 \sum n B N_{t es t} (R y_{t - 1})^{⊤} Q y_{t} + h_{t}^{⊤} U y_{t} + b^{⊤} y_{t})

p (y ∣ x) \propto exp (t = 1 \sum n B N_{t es t} (R y_{t - 1})^{⊤} Q y_{t} + h_{t}^{⊤} U y_{t} + b^{⊤} y_{t})

B N_{t es t} (R y_{i}) = \frac{R y _{i} - μ _{P}}{σ _{P}^{2} + ϵ}

B N_{t es t} (R y_{i}) = \frac{R y _{i} - μ _{P}}{σ _{P}^{2} + ϵ}

lo g p (y ∣ x) = t \sum lo g p (y_{t} ∣ y_{t - 1}, y_{t + 1}, x)

lo g p (y ∣ x) = t \sum lo g p (y_{t} ∣ y_{t - 1}, y_{t + 1}, x)

\frac{exp ( y _{t - 1}^{⊤} P y _{t} + y _{t}^{⊤} P y _{t + 1} + x _{t}^{⊤} U y _{t} + y _{t}^{⊤} b )}{\sum _{a} exp ( y _{t - 1}^{⊤} P a + a ^{⊤} P y _{t + 1} + x _{t}^{⊤} U a + a ^{⊤} b )} .

\frac{exp ( y _{t - 1}^{⊤} P y _{t} + y _{t}^{⊤} P y _{t + 1} + x _{t}^{⊤} U y _{t} + y _{t}^{⊤} b )}{\sum _{a} exp ( y _{t - 1}^{⊤} P a + a ^{⊤} P y _{t + 1} + x _{t}^{⊤} U a + a ^{⊤} b )} .

lo g p (y ∣ x) = t \sum lo g \frac{φ ( y _{t} , x _{t} , y _{t - 1} )}{\sum _{a, b} φ ( a , x _{t} , b )} .

lo g p (y ∣ x) = t \sum lo g \frac{φ ( y _{t} , x _{t} , y _{t - 1} )}{\sum _{a, b} φ ( a , x _{t} , b )} .

lo g p (y ∣ x) = t \sum lo g (\frac{φ ( y _{t} , x _{t} , y _{t - 1} )}{\sum _{a} φ ( a , x _{t} , y _{t - 1} )} \cdot \frac{φ ( y _{t} , x _{t} , y _{t - 1} )}{\sum _{a} φ ( y _{t} , x _{t} , a )}) .

lo g p (y ∣ x) = t \sum lo g (\frac{φ ( y _{t} , x _{t} , y _{t - 1} )}{\sum _{a} φ ( a , x _{t} , y _{t - 1} )} \cdot \frac{φ ( y _{t} , x _{t} , y _{t - 1} )}{\sum _{a} φ ( y _{t} , x _{t} , a )}) .

p (y_{t - 1} ∣ y_{t}) = \frac{exp ( y _{t - 1}^{⊤} P y _{t} )}{\sum _{a} exp ( a ^{⊤} P y _{t} )} .

p (y_{t - 1} ∣ y_{t}) = \frac{exp ( y _{t - 1}^{⊤} P y _{t} )}{\sum _{a} exp ( a ^{⊤} P y _{t} )} .

lo g p (y ∣ x) = t \sum (lo g p (y_{t} ∣ y_{t - 1}, x_{t}) + lo g p (y_{t - 1} ∣ y_{t})) .

lo g p (y ∣ x) = t \sum (lo g p (y_{t} ∣ y_{t - 1}, x_{t}) + lo g p (y_{t - 1} ∣ y_{t})) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsConditional Random Field

Full text

CRF with Deep Class Embedding for Large Scale Classification

Eran Goldman

[email protected]

Jacob Goldberger

[email protected]

Faculty of Engineering, Bar Ilan University, Israel

Trax Image Recognition

Abstract

This paper presents a novel deep learning architecture for classifying structured objects in ultrafine-grained datasets, where classes may not be clearly distinguishable by their appearance but rather by their context. We model sequences of images as linear-chain CRFs, and jointly learn the parameters from both local-visual features and neighboring class information. The visual features are learned by convolutional layers, whereas class-structure information is reparametrized by factorizing the CRF pairwise potential matrix. This forms a context-based semantic similarity space, learned alongside the visual similarities, and dramatically increases the learning capacity of contextual information. This new parametrization, however, forms a highly nonlinear objective function which is challenging to optimize. To overcome this, we develop a novel surrogate likelihood which allows for a local likelihood approximation of the original CRF with integrated batch-normalization. This model overcomes the difficulties of existing CRF methods to learn the contextual relationships thoroughly when there is a large number of classes and the data is sparse. The performance of the proposed method is illustrated on a huge dataset that contains images of retail-store product displays, and shows significantly improved results compared to linear CRF parametrization, unnormalized likelihood optimization, and RNN modeling. We also show improved results on a standard OCR dataset.

††journal: Computer Vision and Image Understanding

1 Introduction

Object recognition is one of the fundamental problems in computer vision. It involves finding and identifying objects in images, and plays an important role in many real-world applications such as advanced driver assistance systems, military target detection, diagnosis with medical images, video surveillance, and identity recognition. Over the past few years deep convolutional neural networks (CNN) have led to remarkable progress in image classification (He et al., 2016; Krizhevsky et al., 2012), and resulted in reliable appearance-based detectors; e.g., (Lin et al., 2018; Liu et al., 2016; Redmon and Farhadi, 2017; Ren et al., 2015; Goldman et al., 2019).

Fine-grained object recognition aims to identify subcategory object classes, which includes finding subtle differences among visually similar subcategories such as dog breeds, product brands, car models, etc. The differences between classes are often small but always visually measurable, making visual recognition challenging but possible. Some of these datasets (e.g. UT-Zap50K (Yu and Grauman, 2014)) provide each class with an in-vitro image: a catalog or studio image isolated and captured under ideal imaging conditions; other datasets (e.g. Caltech-UCSD Birds (Wah et al., 2011), Stanford Dogs (Khosla et al., 2011), FGVC-Aircraft (Maji et al., 2013)) provide each class with several in-situ images, captured in natural real-world environments. Nonetheless, the image quality is mostly satisfactory for the task of visual classification. In fact, recent studies achieved good performance on fine-grained tasks (Lin et al., 2015; Peng et al., 2018; Zhang et al., 2014).

However, the problem remains extremely difficult when the dataset categories are nearly identical in terms of their visual appearance. In this case, the object categories may be virtually indistinguishable, since the discriminant features are often masked by inadequate observation or visual artifacts. Here we present an ultrafine-grained structured classification dataset; unlike other fine-grained classification datasets, our images are in-situ low-resolution cropped patches whose classes are often virtually indistinguishable by visual inspection alone. Therefore, incorporating additional sources of information to the classifier is imperative. Since the object-patches originate from larger scenes, we can model contextual relations between the objects based on their geometric layout. This study tackles the challenge of fine-grained, large-scale structured classification, and describes a novel, state-of-the-art technique for this task.

We address the problem of classifying a sequence of objects based on their visual appearance and their relative locations. Our dataset contains photos of retail store product displays, taken in varying settings and viewpoints. Our task is to identify the class of each product at the front of the shelves. The dataset is exclusively characterized by having a distinct geometric object structure made up of sequences of shelves, a large number of classes, and very subtle visual differences between groups of classes in that some classes only differ in size or minor design details. The unique challenges in this task involve (a) large-scale classification: handling the large number of possible classes, and (b) ultrafine-grained structured classification: the fact that the classes are not clearly distinguishable by their appearance but rather by their context. For example, products with an identical appearance but different container volumes are considered different classes (see examples in Fig. 1).

Because an object’s local appearance may not suffice for accurate categorization, additional information needs to be considered. In real world images, contextual data provides useful information about the spatial and semantic relationships between objects. Modeling a joint visual-contextual classifier is nontrivial in that some contextual cues are very informative, whereas others are irrelevant or even misleading (Barnea and Ben-Shahar, 2019; Yu et al., 2016). Therefore, most deep learning detectors classify each detected object individually without taking the contextual information into account. Moreover, the handful of existing context-aware methods do not have the learning capacity for complex datasets such as ours, and cannot properly apply large-scale fine-grained structured classification.

Related Work: Context has been used to improve performance for image understanding tasks in various ways (Divvala et al., 2009; Felzenszwalb et al., 2010; Torralba, 2003). Graphical models have been widely applied to visual and auditory analysis tasks, by jointly modeling local features, and contextual relations. The tasks addressed by these models include image segmentation and object recognition (Chandra et al., 2017; Chen et al., 2018; Gould et al., 2009; Rabinovich et al., 2007; Wang et al., 2015; Yao et al., 2012; Zheng et al., 2015), as well as speech (Wang and Wang, 2012), music (Korzeniowski and Widmer, 2016), text (Chen et al., 2016) and video analysis (Hu et al., 2014).

Few studies have applied deep learning features or detection results to context models: Chen et al. (2015) explored several techniques to learn structured models jointly with deep features that form MRF potentials. Chu and Cai (2018) evaluated the performance of a joint CRF model on Faster R-CNN (Ren et al., 2015) detection results using an a-priori statistical summary for the pairwise potentials. Korzeniowski and Widmer (2016) introduced a two-stage learning model for musical chord recognition: one network learns a single-frame representation, and the other learns the potentials of a linear-chain CRF model using the frame-representations as the CRF input. These models use the vanilla CRF parametrization, which includes pairwise potentials to represent object-pair interactions. They allocate a different parameter to each class pair. This approach, which ignores class similarities, is only sufficient for small sets of distinct classes. In effect, they have solely been tested on OCR datasets, which contain 26 classes (Chen et al., 2015), a chord-recognition dataset with 25 classes (Korzeniowski and Widmer, 2016) and PASCAL VOC 2007 with 20 classes (Chu and Cai, 2018). However, this formulation is not sufficient for a large class-set that contains visually similar classes. Our dataset, which includes many visually similar categories, nearly a thousand classes and a million possible pairwise transitions overall, requires a more advanced learning mechanism. Furthermore, whereas in most previous object recognition studies the visual information was dominant, in our task, context information also makes a significant contribution.

In this study we present a Conditional Random Field (CRF) based method that explicitly learns the embedding of classes with respect to their neighbor’s class and appearance. This is achieved by factorizing the CRF pairwise potential matrix to impose the structure of class embedding in a low-dimensional space. Our model learns the factorized parameters, and yields a joint contextual-visual embedding of the classes. The factorization drastically increases the learning capacity of contextual information, but also forms a multi-modal likelihood function which is more challenging to optimize. To overcome this, we develop a local surrogate likelihood and apply the proper regularization required for convergence. To train the network, we introduce a pairwise softmax architecture that optimizes a local approximation of the likelihood. Since the global factorized loss function is not convex, we favor optimizing the approximate surrogate likelihood, which allows us to include batch-norm related regularization for the object samples, and achieve dramatic improvement not only in training time and model simplicity but also in terms of the overall performance of the trained model. At test time, dynamic programming techniques are used for efficient exact inference of the classes. The contribution of this work is twofold:

Combining deep class embedding into a CRF formulation that enables handling datasets with a huge number of classes. 2. 2.

An approximated-likelihood training procedure that is both computational efficient and, unlike exact CRF likelihood, enables us to incorporate batch-normalization into the training procedure.

We validate our method on a large image dataset and on an OCR dataset. Direct comparison of our method to most relevant previous work show superior performance. The rest of the paper is organized as follows. In section 2 we describe a CRF model with a class embedding formulation and present the learning and inference algorithms. Section 3 contains a detailed data description and comparative experimental results. Object embedding analysis is described in Section 4 and the conclusions are given in Section 5.

2 CRF With Normalized Class Embedding

2.1 Model Formulation

We are given a sequence of observations $\mathbf{x}=(x_{1},...,x_{n})$ . The data can be a sequence of image patches which correspond to an horizontal layout of objects (see example in Fig. 2). The goal is to classify each object in the sequence to one of a predefined $k$ categories where $k$ is a large number. A standard CNN can classify the object in each image patch individually, implicitly assuming independence between elements in the sequence. In order to include context in the classification process, we model the sequence as a CRF.

We first use a local CNN to obtain a non-linear representation of the input image. Similar to the concept of transfer-learning, we can discard the CNN softmax layer, and use the convolutional layers to compute the feature-vectors of the input images. For each image-patch $x_{t}$ we define the feature vector $h_{t}=h(x_{t})$ as the activations of the last hidden fully-connected or global-average-pooling (GAP) layer (Bengio et al., 2013; Lin et al., 2013), and use it as the CRF input observation feature vector.

Linear-chain Conditional Random Field (LC-CRF) (Lafferty et al., 2001) is a type of discriminative undirected probabilistic graphical model, whose conditional distribution $p(\mathbf{y}|\mathbf{x})$ obeys a conditional Markov property. The joint probability distribution of a linear-chain CRF is:

[TABLE]

where $\mathbf{x}=(x_{1},...,x_{n})$ is the input sequence, $\mathbf{y}=(y_{1},...,y_{n})$ is the corresponding sequence of the target labels, $\varphi$ is the model’s potential function and $Z$ is the partition function defined as the global probability normalization over all possible sequence label-assignments of length $n$ . We further assume that the potential function is defined as a simple log-linear function of the model parameters:

[TABLE]

The CRF model parameters are $P$ , $U$ and $b$ where $P$ is a $k\times k$ pairwise potential matrix that models the relation between consecutive labels, $U$ is a unary potential matrix and the vector $b$ is the label bias. Note that we use a one-hot encoding for the labels.

The rationale for using a deep representation $h(x_{t})$ for the input images is clear: as introduced by Krizhevsky et al. (2012), the immense complexity of the visual object recognition task requires a model with a very large learning capacity. Convolutional layers provide the structure required for learning visual features of the unary input. We aim to craft a suitable structure to learn the pairwise contextual relations as well.

CRF was originally applied to language processing tasks such as Part of Speech (POS) tagging and Named Entity Recognition (NER) (Lafferty et al., 2001). In most applications of CRF to either language or image understanding, there are no more than a few dozen different classes. Our dataset contains nearly a thousand classes and the pairwise potential matrix $P$ has therefore nearly a million parameters. In order to properly learn and generalize the massive variety of possible neighboring patterns, we enforce a structure on the pairwise potential matrix: the goal is to learn neighboring-class embedding in a feature vector space. For this purpose, we define a low-dimensional decomposition of the pairwise potential matrix $P$ as the product of the left-side neighbor embedding matrix $R$ and the class embedding matrix $Q$ :

[TABLE]

The columns of $Q$ are low-dimensional embeddings of the target classes, and the columns of $R$ are embeddings of the classes of the left-side object. Assigning the matrix factorization (3) to the CRF potential function (2) we get:

[TABLE]

Given the values of the model parameters, dynamic programming algorithms can be used for efficient and exact inference. The Viterbi algorithm finds the most probable sequence label assignment, and the Forward-Backward algorithm extracts the marginal probability of each item by summation over all possible assignments (Sutton and McCallum, 2006). The computational complexity of the forward-backward and Viterbi algorithms is quadratic in the number of classes. In the next section we show that the matrix factorization improves classification performance. Note that the matrix factorization also improves the computation complexity of the dynamic programming algorithms used for the classification procedure. The factorization brings the complexity down to the number of classes multiplied by the factorization dimensionality.

2.2 Learning the Model’s Parameters

In the training phase we assume the availability of $s$ labeled sequences $(\mathbf{x}_{1},\mathbf{y}_{1}),...,(\mathbf{x}_{s},\mathbf{y}_{s})$ . The likelihood function of the factorized CRF model defined above is:

[TABLE]

where $\mathbf{h}(\mathbf{x}_{i})$ is the feature vectors of the sequence $\mathbf{x}_{i}$ and $i$ goes over the sequences in the training data. The likelihood function can be maximized by applying standard Stochastic Gradient (SG) based methods. Since the CRF underlying graph is loop-free, it is tractable to compute the likelihood function and its gradient using the forward-backward algorithm (Sutton and McCallum, 2006). In case there is no low-dimensionality constraint on $P$ , the likelihood is a concave function of the model parameters $P$ , $U$ and $b$ (Sutton and McCallum, 2006) and the optimal parameter can be easily found. The factorization of the pairwise potential matrix $P=R^{{\scriptscriptstyle\top}}Q$ causes the likelihood (5) to be a non-concave function of the model parameters and therefore there is no guarantee that gradient methods will converge to the global maximum likelihood. Hence, there no theoretical reason to favor optimizing the exact likelihood over approximate local variants that have better generalization capabilities.

We next propose a novel learning approach that is based on optimizing an approximated CRF objective function, that can be used as a surrogate likelihood. It also allows incorporating batch-normalization into the training procedure. In the next section we show that this method, which learns to balance the CRF features, significantly improves the classification performance.

Our approximated objective is inspired by the MEMM formulation (McCallum et al., 2000). Linear-chain CRFs were originally introduced as an improvement on the Maximum Entropy Markov model (MEMM), which is essentially a Markov model in which the transition distributions are given by a logistic regression model. CRF and MEMM can be written with the same set of parameters. The main difference between CRFs and MEMMs is that a MEMM uses per-state exponential models for the conditional probabilities of next-states given the current-state, whereas the CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation-sequence. The MEMM directed graphical modeling in our case is:

[TABLE]

where

[TABLE]

When applying MEMM for inference it suffers from the label bias problem (Kakade et al., 2002; Lafferty et al., 2001) which may lead to a drop in performance in some applications. Here, however, we propose applying the MEMM objective only as a local approximation to learn the parameter set of the linear-chain CRF model whereas the test time inference still uses a global normalization of CRF modeling and thus avoids the label bias problem. In the appendix we review standard likelihood approximation strategies for efficient CRF training and show that the training method we use in this study can be viewed as a simplified version of the piecewise-pseudolikelihood approximation (Sutton and McCallum, 2007).

Our objective function is, therefore, defined as the conditional probability of the current-object class, given the class of the left-side neighbor object:

[TABLE]

where $i$ goes over the sequences and $t$ goes over the objects in the sequence, $h_{i,t}$ is the object CNN-based representation, $y_{i,t}$ is the true class label and $p()$ is as defined at (7). Fig. 3 illustrates the difference between the exact CRF objective and the MEMM objective we use as a CRF approximation.

The proposed objective (8) does not necessarily eliminate significant contextual information. Rather, when learning the factorized CRF, it may enrich the training dataset, improve the stochastic nature of the SG optimization process and help to prevent overfitting since there are many more object samples than sequence samples, and the mini-batches are composed of adjacent pairs of objects taken from random training samples. In contrast, restricting the mini-batches to contain full sequences, would decrease the model’s freedom to discover better solutions for the objective of pairwise transition parameters. In fact, as we empirically show in the next section, optimizing (8) yields better results than optimizing the exact likelihood (5). Note that the computational complexity of the approximated likelihood (8) is linear in the number of classes and therefore the training process is much faster.

2.3 Feature Scaling with Batch Normalization

In optimization, feature standardization or whitening is a common procedure that has been shown to reduce the convergence rates (Orr and Müller, 2003). In deep neural networks, whitening the inputs to each layer may also prevent converging into poor local optima. However, training a deep neural network is complicated by the fact that the inputs to each layer are affected by the parameters of all preceding layers, and need to continuously adapt to the new distribution. The batch-normalization (BN) (Ioffe and Szegedy, 2015) method draws its strength from making normalization part of the model architecture and performing the normalization for each training mini-batch.

Our MEMM based training objective function (8) models the current state class conditional distribution by a logistic regression (LR) model. Therefore, Eq. (7) can be written as:

[TABLE]

The first term is the input features to the LR and the second term is the LR parameters $Q$ , $U$ and $b$ . The input features are composed of the visual features of the CNN $h_{t}$ and the learned neighbor embeddings $Ry_{t-1}$ . Since $y_{t-1}$ is a one-hot vector, the columns of $R$ are dense representations of the classes. The standardization of the input feature vector $(Ry_{t-1},h_{t})$ is important in order to avoid inherent bias between the local-visual and contextual class information. The goal is to encourage each input feature to have normalized mean and variance. Since the CNN is pre-trained we can compute the mean and variance vectors of the visual features as a pre-processing stage, and use standardized CNN feature vectors ${h}$ for CRF training and inference. In contrast, the context class embeddings $Ry_{t-1}$ are jointly learned with the LR layer and thus are changed during the training process. Hence, we use the batch-normalization (Ioffe and Szegedy, 2015) method to learn their mini-batch normalization during the training process.

The input layer and the LR layer of the MEMM training procedure are illustrated in Fig. 4.

Formally, by applying batch normalization to the context representation $Ry_{t-1}$ , Eq. (7) is replaced by:

[TABLE]

where $\text{BN}{(x_{i})}=\dfrac{x_{i}-\mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^{2}+\epsilon}}$ with mini-batch mean and variance:

[TABLE]

This way, the batch-normalization encourages the activations to have standard distributions during training, and tracks the moving averages of normalization parameters for the inference stage. The training method is summarized in Table 1.

In the CRF exact likelihood, the weights in each sequence-level sample are shared across multiple locations in varying numbers of appearances, and the potential factors $R,Q$ are jointly used. Hence, batch normalization cannot be applied directly to learn mini-batch statistics, and, as a matter of fact, previous CRF studies (e.g. (Chandra et al., 2017; Wang et al., 2015; Durrett and Klein, 2015)) did not use it. A major advantage of the approximate MEMM likelihood we use is that the potential factors $P=R^{{\scriptscriptstyle\top}}Q$ are used in a sequential order: We first apply the matrix $R$ on the context label and then the matrix $Q$ on the current label (see Eq. 9). This is a natural setup for BN, which allows a very simple and effective way to apply batch-normalization to each neighboring-label sample.

After convergence of the training stage, we apply the batch-normalization inference procedure: Each column in matrix $R$ is standardized by the training population statistics $\mu_{\mathcal{P}}$ and $\sigma_{\mathcal{P}}$ , estimated from the moving averages of mini-batch statistics tracked during training (Ioffe and Szegedy, 2015):

[TABLE]

At test time, we compute the standardized CNN representation vector ${h}$ for each object in the sequence, and classify the objects using the forward-backward algorithm as described above. The inference procedure is summarized in Table 1.

3 Experiments

3.1 The Dataset

Our dataset contains sequences of fixed-size image patches, originated from in-situ photos of retail store displays taken in supermarkets and grocery stores. The objects are the inventory items positioned at the front of the displays, and the classes are their stock-keeping-unit (SKU) unique identifiers. Each object was originally annotated by its class label and bounding-box coordinates. The image patches were cropped and reshaped into single-object images of size $150\times 450$ pixels, and grouped into shelves; i.e., sequences of horizontal layouts, arranged from left to right. The benchmark contains 76,081 sequences of 460,121 single-object images, originated from 24,024 photos of store displays. Each object is labeled as one of 972 different classes. Sequence lengths can vary from 2 to 32, and are typically between 4-12. The average sequence length is $6$ and the length standard deviation is $2.4$ . To perform $k$ -fold cross-validation, we split the dataset into 5 mixes of 80% training and 20% testing.

Many groups of classes belong to the same archetype, and only differ in terms of minor details such as volume, flavor, nutrient content etc. They often share similar visual features, which makes appearance-based classification very difficult. On the other hand, the object layout behavior is very coherent: it is dictated by the supplier planograms (specified product layouts) and extracted from the image realograms (observed product layouts). Although realograms are non-deterministic by nature, consistent semantic patterns are frequently spotted. Class transition behavior may be discovered, revealing tendencies of pairs to appear as left-to-right neighbors, and individual classes to appear multiple times successively. The unique challenges we faced in our task are derived from the large number of visually similar classes, which co-occur in distinct structures in large-scale images.

3.2 Implementation Details

We first train a ResNet50 CNN (He et al., 2016) from scratch to compute the hidden representation vector ${h}_{s\times 1}$ for each image-patch. In our implementation the hidden layer size (after global average pooling) was $s=2048$ . Then, as a preprocessing step for the CRF model, we calculate the mean and standard deviation of each feature of the hidden representation vector from the training dataset: ${\mu}_{s\times 1},{\sigma}_{s\times 1}$ .

The number of classes in our dataset is $m=972$ , and the class embedding dimensionality we use is $d=32$ . We learn a class embedding matrix $Q_{d\times m}$ , a neighbor embedding matrix $R_{d\times m}$ , a unary potential matrix $U_{s\times m}$ and a bias vector ${b}_{m\times 1}$ . We initialize the bias parameter to [math] and the weight parameters with random Gaussian samples $\mathcal{N}(0,0.01)$ for symmetry breaking. We train the network as described in Section 2, using SG with mini-batches of size 128, and maximizing the log-likelihood function (8) with and $l_{2}$ regularization factor $\lambda=5\cdot 10^{-4}$ for all network parameters. The training samples in each mini-batch are object-pairs selected randomly from the benchmark.

Runtimes were measured on the same machine using an Intel(R) Core(TM) i7-5930K CPU @3.50GHz GeForce and a GTX Titan X GPU. A single batch epoch of the baseline unary system took 46 sec, a global optimization algorithm took 780 sec and our local optimization took 47 sec. The local training procedure is more efficient than computing the global maximum likelihood, because its time complexity is linear in the number of classes, whereas the global training procedure is quadratic in the number of classes. When using the factorized pairwise matrix, the global training time complexity can be reduced to the number of classes times the embedding dimensionality. The most important contribution of the approximate likelihood is hence in performance due to its ability to add batch-normalization to the nonlinear objective. At test time it took less than 0.1 seconds to classify all the objects in a single image.

3.3 Comparative experimental results

In order to validate the performance of our method we implemented several alternatives. They are all based on the same contextless CNN local information and only differ in the way they learn the object contextual information from the training dataset and integrate the context model with the local CNN. Below is a list of the baseline models we implemented.

Unary The baseline comparison model is the original CNN we trained without any contextual information.

Pairwise Statistics Based on the work of Chu and Cai (2018), we created a CRF model with unary potentials taken from the CNN classifier prediction results, and the pairwise potentials are pairwise statistic $P_{ij}=p(j|i)=p(y_{t}=j|y_{t-1}=i)$ that are estimated from the training dataset. In other words, the context information is modeled by a stationary first-order Markov chain. No additional NN training is applied. The only single parameter we need to set is the relative weight of the unary and pairwise potentials. This weight, which adjusts the tradeoff between the local appearance and the contextual information, was selected via cross-validation.

Recurrent Neural Network Another modeling option for a sequence estimation is the Bi-Directional Recurrent Neural Network (Schuster and Paliwal, 1997) with LSTM (Hochreiter and Schmidhuber, 1997) as memory block (BiLSTM). In this approach we compute the posterior distribution of the current object label based on all the visual information provided by the CNN: $p(y_{t}|\mathbf{x})=p(y_{t}|x_{1},...,x_{n})$ . The BiLSTM architecture learns a context vector $c_{t}$ for each object, which encapsulates the bidirectional information in the sequence input observations transferred from the CNN output $h_{1},...,h_{n}$ , and learns a softmax prediction $p(y_{t}|c_{t})$ for each object label. This approach, however, did not outperform the original contextless CNN. The most important information, in addition to the object local appearance, is the label relations between neighboring objects which are not captured here; the BiLSTM network uses a softmax output layer that provides a separate prediction for each class and thus ignores class similarities. Although there is some structural similarity between RNN techniques and our local likelihood approach for CRF, the underlying probabilistic model is very different. Hence, CRFs are preferable for the element-wise classification of observed sequences because (a) they can explicitly learn second-order class similarity which is often the dominant source of contextual information, and (b) the Markovity assumption provides an optimal solver over the entire sequence.

Log-linear CRF This method learns the log-linear parameters of the linear-chain CRF (2). We implemented both the exact and approximate likelihood training methods and tried both $l_{1}$ and $l_{2}$ regularizations for the pairwise potential matrix. We also tried whitening the one-hot input vectors. The results provided minor improvement over the baseline contextless classifier.

Factorized CRF This method learns the factorized parameters of the pairwise weight matrix as defined in Eq. (4).

Approximate Factorized CRF + BN This is the model proposed in this study: The CRF pairwise weight matrix is factorized as defined in Eq. (4), the network is trained as described in subsection 2.2, using the surrogate likelihood (8), with $l_{2}$ weight regularization, and batch-normalization for the embedding features.

Quantitative Results: Table 2 lists the results in terms of model error rate, indicates the incremental improvement in accuracy over model variations, and shows that the non-linear method based on batched-normalized class embedding yields significantly better results than the other alternatives. Figure 5 depicts the Precision-Recall curve measured for the different methods by applying different confidence thresholds. It is particularly useful for our original objective, which aimed to maximize recall while preserving high (90%+) precision. It shows that the MEMM based training method with batch normalization achieved significantly higher recall than the alternatives. Our method achieved a recall of $85.75\%$ , whereas the unary baseline recall was $79.97\%$ , and the second best alternative was $82.46\%$ - all preserving the same $90\%$ precision. It is worth pointing out that our test set is considerably large which means that we correctly identified around 3,000 more objects than the unary model, and 2,000 more objects than the second best alternative.

3.4 Ablation Study

We performed an ablation analysis aimed at isolating the effect of the various innovations suggested. Each experiment uses the same configuration as in our method with only one alteration.

Feature Scaling We tested the following variants: removing the batch-normalization layer entirely, removing the whitening of the CNN activations and Whitening the one-hot vectors at the input of the embedding layer instead of batch-normalizing its output. In each case the results became much worse and were comparable to the contextless unary network. On the other hand, when adding or removing the scale and shift from the BN parameters, the results remained comparable to our state-of-the-art results. This suggests that the BN layer has enormous impact since it whitens the embedding activations during training, similar to the whitening applied for the CNN activations.

Regularization We tested $l_{1}$ , $l_{2}$ or no regularization for the embedding weights. The results were significantly better with $l_{2}$ regularization, which encourages all the weights for each class in the embedding space to be used in training.

Pairwise Matrix Factorization We considered other variants of the class embedding concept in which the embedding parameters of the target and neighboring labels are tied. For that purpose, we impose the structure of the embedding matrix $R$ on the current class as well as the neighboring class. The pairwise potential in this case is factorized as $P=R^{{\scriptscriptstyle\top}}DR$ to get the same embedding for the class and its neighbor. We may also apply the class embedding on the unary potentials matrix by factorizing $U=V^{{\scriptscriptstyle\top}}R$ . In these parametrizations, applying the embedding-batch-norm requires parameter tying between the softmax inputs and the softmax weights, and thus compromises the effectiveness of the batch normalization process.

Surrogate Likelihood Variances Global optimization of LC-CRF is not only much more time-consuming, but also lacks the ability to apply a straightforward batch normalization strategy, since the activations are shared in multiple locations in each sample in the mini-batch. The same problem appears in other known methods of local likelihood approximation such as piecewise, pseudolikelihood and Piecewise-Pseudolikelihood (PWPL) which are close variants of our local training model (See details in appendix 6.1). Applying an embedding-batch-norm to the pseudolikelihood or PWPL methods would once again require parameter tying between the softmax inputs and the softmax weights. However, the PWPL in our case can be reduced to the form of a forward term which is equivalent to the MEMM-like objective (7) and an additive backwards term which is independent of the CRF input. Hence, the MEMM-like objective function is theoretically highly related to PWPL. In the appendix we explain how our surrogate likelihood can be viewed as a simplified form of the PWPL objective.

Different CNNs We tested our model with various CNN architectures: ResNet50 showed minor improvement in unary CNN performance over Alexnet (Krizhevsky et al., 2012) and VGG (Simonyan and Zisserman, 2014), and comparable improvement when incorporating our context-aware methods. Note that local visual information is often insufficient in our data, hence adding context is very helpful even with very strong CNNs.

Different data We cross-validated our methods on 5 different train-test splits and obtained comparable results and small variances for the different mixes. We also verified our method on an OCR dataset of handwritten words as we describe in 3.5.

Increased Learning Capacity We tried increasing the model’s non-linearity by adding another fully connected layer and nonlinear ReLU between the one-hot vector input and the fully connected embedding layer. We also tried learning the embedding in a higher dimensional space. These enhancements, however, did not improve performance, and turned out to be redundant.

Similarity Networks An additional noteworthy approach to identifying visually similar classes involves using an architecture which receives multiple samples as training input and compares pairs of samples in order to better discriminate between classes based on their visual features. These methods include Siamese Network (Chopra et al., 2005; Bromley et al., 1994) and other variants e.g. (Kim et al., 2019; Hoffer and Ailon, 2015; Zagoruyko and Komodakis, 2015).

In our case, however, such approaches were unhelpful due to the limited priors and the ultrafine-grained nature of our dataset. For example, object volume or flavor are often visually unmeasurable. Furthermore, these approaches ignore class neighbor information which is usually the dominant source of contextual information in observed sequences. Fig. 6 exemplifies the type of difficulties we faced when trying to learn pairwise visual similarities in our data.

3.5 MIT OCR Dataset of Handwritten Words

We tested our method on data from another benchmark, to validate that our approach generalizes well to other domains beyond store shelves and retail objects. Despite the absence of other fine-grained structured classification datasets of a similar scale, the MIT OCR Dataset of Handwritten Words (Kassel, 2017) can provide a close approximation. The dataset contains 6877 words, composed of 52152 samples of lowercase letters collected from 150 human subjects. Each sample is a $16\times 8$ binary pixel image. We split the dataset into 5512 training words and 1365 testing words. The input features of the CRF are the original 128 pixels. There are only 26 classes in the dataset - the English alphabet letters. Therefore, we used a smaller embedding dimension of 16. We performed the following experiments: Unary- prediction of the samples directly from the input without context, Log-linear CRF- learning the log-linear parameters of the linear-chain CRF, Factorized CRF- learning the factorized parameters of the linear-chain CRF, and Factorized CRF + BN- learning the factorized parameters while whitening the high level representation of the input image and neighbor embedding with BN. The CRF models were trained with the approximate likelihood architecture presented in this work. For each experiment, we report the accuracy (Acc), average precision (AP), and the recall values attained for precision of $0.7$ (RP=0.7) and $0.9$ ( RP=0.9). The results are shown in Table 3. Despite the small scale of the data and the small number of classes, our method yielded more accurate results than the alternatives. Especially, the batch normalization procedure applied to the context letter representation, was found to be beneficial.

4 Class Embedding Analysis

As a byproduct of the classification model we also obtain a low-dimensional embedding of the different classes. Each column of the neighbor embedding matrix $R$ is a vector representation of the corresponding class. A common similarity metric is the cosine of the angle between the vectors. We can measure the distance between classes by the cosine of their vector representation. Fig. 7 shows several examples of an object class and its most similar classes. We can see that this similarity does not reflect visual appearance similarity, e.g. in the second example the similar classes have very different colors. This situation has been studied extensively for the linguistic problem of word embedding. The goal of word embedding algorithms is to represent similar words by similar vectors. It is often useful to distinguish between two kinds of similarity or association between words (Schütze and Pedersen, 1993). Two words are said to have first-order co-occurrence if they are typically nearby each other (e.g. wrote is a first-order associate of book or poem). Two words have second-order co-occurrence if they have similar neighbors (e.g. wrote is a second-order associate of words like said or remarked). Second-order word similarity is thus expected to capture a semantic meaning and measure the extent to which the two classes are replaceable based on their tendencies to appear in similar contexts. In Fig. 7 we show that object class embedding captures second-order information. Proximity here corresponds to the mutual tendency to have similar neighbors. We can see in the figure that similar classes, although looking visually different, represent products of similar container-types, volumes and brands.

Visual similarity and second-order semantic similarity are based on two profoundly different criteria, and may be uncorrelated or even have a negative correlation in some cases as we demonstrate in Fig. 8: classes are visually close when it is easy to confuse them based on their visual appearance, and are semantically close when it is statistically reasonable to switch one for the other on a shelf (i.e. “synonymous” classes). The rows in Fig. 8 contain classes that are visually close but semantically far; i.e., they look alike but tend to appear in different contexts, whereas the columns contain classes which are semantically close but visually far; i.e., they look different, but tend to appear in similar contexts. The examples from the retail world refer to classes of similar brands but with different form-factors or volumes, which tend to appear in different displays in stores. A speech analogy would be comparing homophones (e.g. meet vs. meat, sale vs. sail) with synonyms (e.g big vs. large, fast vs. quick).

It is hence clear why these two types of similarity contribute two different types of information, and need to be used jointly for the task of object classification. The visual similarity is relevant for the visual image information whereas the class similarity in the embedded space is relevant for the contextual information.

5 Conclusions

We introduced a novel technique to learn deep contextual and visual features for fine-grained structured prediction of object categories, and tested it on a dataset that contains spatial sequences of objects, and a large number of visually similar classes.

Our model clearly outperforms all the other tested models. This architecture appears to be the most straightforward generalization of a context-less classifier to become context-dependent when both the input and the context data require a large learning capacity: the network learns deep feature vectors for neighboring classes, analogously to the learned deep input representations. The Markovity and stationarity assumptions make it sufficient to train with individual objects as samples to enrich the training data diversity, allow for simple embedding batch normalization, and boost the non-convex optimization process both in terms of time and performance.

6 Appendices

6.1 Local Likelihood Approximation

In this appendix we show how the objective function that we used for optimization is related to previously suggested approaches. Pseudolikelihood (Besag, 1975) is a classical approximation of the CRF likelihood function that simultaneously classifies each node given its neighbors in the graph. The pseudolikelihood objective function only depends on the object and its Markov blanket. The pseudolikelihood of our model (1) is:

[TABLE]

where $p(y_{t}|y_{t-1},y_{t+1},\mathbf{x})$ is

[TABLE]

Piecewise training (Sutton and McCallum, 2005) is a heuristic method to predict the graph factors from separate “pieces” of the graph. The piecewise objective function is equivalent to the likelihood function of a node-split graph (Sutton and McCallum, 2007), which contains all the single-factor components split from the original graph. Using the CRF notation in Eq. (1), the piecewise likelihood approximation in our case is:

[TABLE]

Note that due to the term in the denominator, computing the piecewise likelihood is quadratic in the number of classes. Piecewise Pseudolikelihood (PWPL) is the standard pseudolikelihood applied to the node-split graph. Its computation is efficient because the objective function is simply the sum of local conditional probabilities. In our case, applying the pseudolikelihood approach on the piecewise objective (13) would give us the following PWPL form:

[TABLE]

Sutton and McCallum (2007) showed that in many cases the PWPL has better accuracy than standard pseudolikelihood, and in some scenarios has nearly equivalent performance to piecewise approximation and even to global maximum likelihood. The first term inside the $\log$ function is equivalent to the forward MEMM objective function (7) that we used. The second term can be written in the form:

[TABLE]

This term is independent of the CRF visual input. The PWPL approximation can be thus expressed as:

[TABLE]

Hence the MEMM-like objective function we used (6) can be viewed as a simplified version of the piecewise-pseudolikelihood objective (14) that was found to be the preferred likelihood approximation for language processing tasks (Sutton and McCallum, 2007).

Bibliography54

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Barnea and Ben-Shahar (2019) Barnea, E., Ben-Shahar, O., 2019. Exploring the bounds of the utility of context for object detection, in: CVPR.
2Bengio et al. (2013) Bengio, Y., Courville, A., Vincent, P., 2013. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1798–1828.
3Besag (1975) Besag, J., 1975. Statistical analysis of non-lattice data. The statistician , 179–195.
4Bromley et al. (1994) Bromley, J., Guyon, I., Le Cun, Y., Säckinger, E., Shah, R., 1994. Signature verification using a” siamese” time delay neural network, in: NIPS.
5Chandra et al. (2017) Chandra, S., Usunier, N., Kokkinos, I., 2017. Dense and low-rank gaussian CR Fs using deep embeddings, in: ICCV.
6Chen et al. (2016) Chen, G., Li, Y., Srihari, S.N., 2016. Word recognition with deep conditional random fields, in: ICIP.
7Chen et al. (2018) Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2018. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence .
8Chen et al. (2015) Chen, L.C., Schwing, A., Yuille, A., Urtasun, R., 2015. Learning deep structured models, in: ICML.