Deep Hierarchical Machine: a Flexible Divide-and-Conquer Architecture

Shichao Li; Xin Yang; Tim Cheng

arXiv:1812.00647·cs.CV·December 4, 2018

Deep Hierarchical Machine: a Flexible Divide-and-Conquer Architecture

Shichao Li, Xin Yang, Tim Cheng

PDF

Open Access

TL;DR

The paper introduces Deep Hierarchical Machine (DHM), a flexible divide-and-conquer deep model with probabilistic routing and pruning, optimized for classification and regression tasks with sparse feature extraction.

Contribution

It presents a novel DHM architecture that combines stochastic routing with probabilistic pruning and sparse convolution, enhancing efficiency and flexibility over existing models.

Findings

01

DHM outperforms previous architectures on image classification tasks.

02

DHM demonstrates effective face alignment results.

03

Sparse convolution improves computational efficiency.

Abstract

We propose Deep Hierarchical Machine (DHM), a model inspired from the divide-and-conquer strategy while emphasizing representation learning ability and flexibility. A stochastic routing framework as used by recent deep neural decision/regression forests is incorporated, but we remove the need to evaluate unnecessary computation paths by utilizing a different topology and introducing a probabilistic pruning technique. We also show a specified version of DHM (DSHM) for efficiency, which inherits the sparse feature extraction process as in traditional decision tree with pixel-difference feature. To achieve sparse feature extraction, we propose to utilize sparse convolution operation in DSHM and show one possibility of introducing sparse convolution kernels by using local binary convolution layer. DHM can be applied to both classification and regression problems, and we validate it on…

Tables4

Table 1. Table 1: The test accuracy of different models before and after probabilistic pruning.

Method	Accuracy	After Pruning
NDF	0.9896 $\pm$ 0.0023	Not Able
DHM (separated)	0.9860 $\pm$ 0.0010	0.9853 $\pm$ 0.0010
DHM (connected)	0.9861 $\pm$ 0.0019	0.9856 $\pm$ 0.0020

Table 2. Table 2: The number of multiplication (NOM) before and after probabilistic pruning.

Method	NOM	After Pruning
NDF	3.08M	Not Able
DHM (separated)	20.9M	1.14M
DHM (connected)	15.1M	1.25M

Table 3. Table 3: The number of multiplication (NOM) and test accuracy of DSHM with different sparsity levels. Note here the NOM does not consider PP and should be compared with DHM without PP.

Sparcity level	NOM	Accuracy
0.3	8.05M	0.9719 $\pm$ 0.0011
0.5	8.05M	0.9725 $\pm$ 0.0021
0.7	8.05M	0.9709 $\pm$ 0.0024

Table 4. Table 4: The comparison of traditional NDF architecture and our DHM for regression task. Numbers in the parentheses give the results with PP and only one computational path was taken.

Method	NOM (After Pruning)	Error (After Pruning)
NDF	254M (Not able)	0.0643 (Not able)
DHM	228M (35.6M)	0.0628 (0.06382)

Equations24

0 \leq s_{i, s} (j) \leq 1, j = 1 \sum ∣ K_{i, s} ∣ s_{i, s} (j) = 1

0 \leq s_{i, s} (j) \leq 1, j = 1 \sum ∣ K_{i, s} ∣ s_{i, s} (j) = 1

P = (i, s) \in N_{c} \sum w_{i, s} p_{i, s}

P = (i, s) \in N_{c} \sum w_{i, s} p_{i, s}

w_{i, s} = m = 1 \prod s s_{i_{m}, s_{m}} (j_{m})

w_{i, s} = m = 1 \prod s s_{i_{m}, s_{m}} (j_{m})

P (y_{i} ∣ x_{i}) = (i, s) \in N_{c} \sum m = 1 \prod s s_{i_{m}, s_{m}} (j_{m}) p_{i, s} (y_{i})

P (y_{i} ∣ x_{i}) = (i, s) \in N_{c} \sum m = 1 \prod s s_{i_{m}, s_{m}} (j_{m}) p_{i, s} (y_{i})

L (D) = - i = 1 \sum N lo g (P (y_{i} ∣ x_{i}))

L (D) = - i = 1 \sum N lo g (P (y_{i} ∣ x_{i}))

P (C_{i} ∣ x) = D_{j} \in N_{d} \prod s_{j}^{\mathbbm 1 (C_{i} \in D_{j}^{l})} (1 - s_{j})^{\mathbbm 1 (C_{i} \in D_{j}^{r})}

P (C_{i} ∣ x) = D_{j} \in N_{d} \prod s_{j}^{\mathbbm 1 (C_{i} \in D_{j}^{l})} (1 - s_{j})^{\mathbbm 1 (C_{i} \in D_{j}^{r})}

t = 1 \sum N (\frac{\sum _{C_{j} \in D_{i}^{l}} p _{j} ( y _{t} ) P ( C _{j} ∣ x _{t} )}{s _{i} P ( y _{t} ∣ x _{t} )} - \frac{\sum _{C_{j} \in D_{i}^{r}} p _{j} ( y _{t} ) P ( C _{j} ∣ x _{t} )}{( 1 - s _{i} ) P ( y _{t} ∣ x _{t} )})

t = 1 \sum N (\frac{\sum _{C_{j} \in D_{i}^{l}} p _{j} ( y _{t} ) P ( C _{j} ∣ x _{t} )}{s _{i} P ( y _{t} ∣ x _{t} )} - \frac{\sum _{C_{j} \in D_{i}^{r}} p _{j} ( y _{t} ) P ( C _{j} ∣ x _{t} )}{( 1 - s _{i} ) P ( y _{t} ∣ x _{t} )})

p_{j}^{t + 1} (y) = \frac{1}{Q _{j}^{t}} i = 0 \sum N \frac{\mathbbm 1 ( y _{i} = y ) p _{j}^{t} ( y _{i} ) P ( C _{j} ∣ x _{i} )}{P ( y _{i} ∣ x _{i} )}

p_{j}^{t + 1} (y) = \frac{1}{Q _{j}^{t}} i = 0 \sum N \frac{\mathbbm 1 ( y _{i} = y ) p _{j}^{t} ( y _{i} ) P ( C _{j} ∣ x _{i} )}{P ( y _{i} ∣ x _{i} )}

P_{i} = (i, s) \in N_{c} \sum m = 1 \prod s s_{i_{m}, s_{m}} (j_{m}) p_{i, s}

P_{i} = (i, s) \in N_{c} \sum m = 1 \prod s s_{i_{m}, s_{m}} (j_{m}) p_{i, s}

L (D) = \frac{1}{2} i = 1 \sum N ∣∣ P_{i} - y_{i} ∣ ∣^{2}

L (D) = \frac{1}{2} i = 1 \sum N ∣∣ P_{i} - y_{i} ∣ ∣^{2}

\frac{\partial L ( D )}{\partial s _{i}} = t = 1 \sum N (P_{t} - y_{t})^{T} (\frac{A _{l}}{s _{i}} - \frac{A _{r}}{( 1 - s _{i} )})

\frac{\partial L ( D )}{\partial s _{i}} = t = 1 \sum N (P_{t} - y_{t})^{T} (\frac{A _{l}}{s _{i}} - \frac{A _{r}}{( 1 - s _{i} )})

p_{j}^{t + 1} = \frac{\sum _{i = 0}^{N} y _{i} P ( C _{j} ∣ x _{i} )}{\sum _{i = 0}^{N} P ( C _{j} ∣ x _{i} )}

p_{j}^{t + 1} = \frac{\sum _{i = 0}^{N} y _{i} P ( C _{j} ∣ x _{i} )}{\sum _{i = 0}^{N} P ( C _{j} ∣ x _{i} )}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Face and Expression Recognition · Advanced Image and Video Retrieval Techniques

MethodsPruning · Convolution

Full text

Deep Hierarchical Machine: a Flexible Divide-and-Conquer Architecture

Shichao Li

Hong Kong University of Science and Technology

Xin Yang

Huazhong University of Science and Technology

Tim Cheng

Hong Kong University of Science and Technology

Abstract

We propose Deep Hierarchical Machine (DHM), a model inspired from the divide-and-conquer strategy while emphasizing representation learning ability and flexibility. A stochastic routing framework as used by recent deep neural decision/regression forests is incorporated, but we remove the need to evaluate unnecessary computation paths by utilizing a different topology and introducing a probabilistic pruning technique. We also show a specified version of DHM (DSHM) for efficiency, which inherits the sparse feature extraction process as in traditional decision tree with pixel-difference feature. To achieve sparse feature extraction, we propose to utilize sparse convolution operation in DSHM and show one possibility of introducing sparse convolution kernels by using local binary convolution layer. DHM can be applied to both classification and regression problems, and we validate it on standard image classification and face alignment tasks to show its advantages over past architectures.

1 Introduction

Divide-and-conquer is a widely-adopted problem-solving philosophy which has been demonstrated to be successful in many computer vision tasks, e.g. object detection and tracking [9] [21]. Instead of solving a complete and huge problem, divide-and-conquer suggests decomposing the problem into several sub-problems and solving them in different constrained contexts. Figure 1 illustrates this idea with a binary classification problem. Finding a decision boundary in the original problem space is difficult and leads to a sophisticated nonlinear model, but linear decision models could be more easily obtained when solving the sub-problems.

The traditional decision tree, which splits the input feature space at each splitting node and gives the prediction at a leaf node, inherently uses the divide-and-conquer strategy as an inductive bias. The designs of input features and splitting functions are key to the success of this model. Conventional methods usually employ hand-crafted features such as the pixel-difference feature [10, 7, 14, 23] and Harr-like feature [24]. However, the input space for vision tasks are usually high-dimensional and often lead to a huge pool of candidate features and splitting functions that are impractical for an exhaustive evaluation. In practice the huge candidate pool is randomly sampled to form a small candidate set of splitting functions and a local greedy heuristic such as entropy minimization is adopted to choose the ”best” splitting function which maximizes data ”purity”, limiting the representation learning ability of the traditional decision tree.

Deep neural decision forests [8] was proposed to enable a decision tree with deep representation learning ability. In [8], the outputs of the last fully connected layer of a CNN are utilized as stochastic splitting functions. A global loss function is differentiable with respect to the network parameters in this framework, enabling greater representation learning ability than the local greedy heuristics in conventional decision trees. Deep regression forests [19] was later proposed for regression problems based on the general framework of [8]. However, the success in introducing representation learning ability comes with the price of transforming decision trees into stochastic trees which make “soft” decision at each splitting node. As a result, all splitting functions have to be evaluated as every leaf node contributes to the final prediction, yielding a significant time cost. Pruning branches that contribute little to the final prediction should effectively reduce the computational cost with little accuracy degradation. Unfortunately, the network topology used in previous works [8, 19] requires a complete forward pass of the entire CNN to compute the routing probability for each splitting node, making pruning impractical.

A major advantage of the divide-and-conquer strategy (e.g. random forests) is its high efficiency in many time-constraint vision tasks such as face detection and face alignment. Simple and ultrafast-to-compute features such as pixel difference, only extract sparse information (e.g. two pixels) from the image space. However, existing deep neural decision/regression forests [8, 19] completely ignore the computational complexity of splitting nodes and in turn greatly limit their efficiency.

In this work, we propose a general tree-like model architecture, named Deep Hierarchical Machine (DHM), which utilizes a flexible model topology to decouple the evaluation of splitting nodes and a probabilistic pruning strategy to avoid the evaluation of unnecessary paths. For the splitting nodes, we also explore the feasibility of inheriting the sparse feature extraction process (i.e. the pixel-difference feature) of the traditional random forests and design a deep sparse hierarchical machine (DSHM) for high efficiency. We evaluate our method on standard image classification and facial landmark coordinate regression tasks and show its effectiveness. Our implementation can be easily incorporated into any deep learning frameworks and the source code and pre-trained models will be available on the website111The website address is currently unavailable.. In summary, our contributions are:

We propose Deep Hierarchical Machine (DHM) with a flexible model topology and probabilistic pruning strategy to avoid evaluating unnecessary paths. The DHM enjoys a unified framework for both classification and regression tasks.

2.

We introduce sparse feature extraction process into DHM, which to our best knowledge is the first attempt to mimic traditional decision trees with pixel-difference feature in deep models.

3.

For the first time, we study using deep regression tree for a multi-task problem, i.e., regressing multiple facial landmarks.

2 Related works

We list three related topics in this section to show two trends in the computer vison community. The first is the migration from hand-crafted features towards deep representation learning for divide-and-conquer models, the second is the generalization of sparse pixel-based features to sparse operation in deep convolutional neural networks.

2.1 Traditional divide-and-conquer models

Traditional decision trees or random forests [18, 1] can be naturally viewed as divide-and-conquer models, where each non-leaf node in the tree splits the input feature space and route the input deterministically to one of its children nodes. These models employ a greedy heuristic training procedure which randomly samples a huge pool of candidate splitting functions to minimize a local loss function. The parameter sampling procedure is sub-optimal compared to using optimization techniques, which in combination of the hand-crafted nature of the used features, limit these models’ representation learning ability. Hierarchal mixture of experts [5] also partitions the problem space in a tree-like structure using some gating models and distribute inputs to each expert model with a probability. A global maximum likelihood estimation task was formulated under a generative model framework, and EM algorithm was proposed to optimize linear gating and expert models. This work inspires our methodology but deep representation learning and probabilistic pruning was not studied at that time.

2.2 Deep decision/regression tree

[8, 19] proposed to extract deep features to divide the problem space and use simple probabilistic distribution at leaf nodes. These models enabled traditional decision/regression trees with deep representation learning ability. Leaf node update rules were proposed based on convex optimization techniques, and they out-performed deep models without divide-and-conquer strategy. However, since the last layer of a deep model was used to divide the problem space, every path in the tree needs to be computed. Even when a branch of computation contributes little to the final prediction, it stills need evaluation because each splitting node requires the full forward-pass of the deep neural network. A model structure where each splitting node is separately evaluated was used [17] for depth estimation, but a general framework was missing and the effect of computation path pruning was not investigated.

2.3 Sparse feature extraction

Pixel-difference feature is a special type of hand-crafted feature where only several pixels from an input are considered during its evaluation. They are thus efficient to compute and succeeded in computer vision tasks such as face detection [10], face alignment [14, 7, 3, 23, 4], pose estimation [20, 22] and body part classification [15]. These features were also naturally incorporated into decision/regression trees to divide the input feature space. A counterpart of sparse feature extraction process in CNNs is sparse convolution where the few non-zero entries in the convolution kernel determine the feature extraction process. To obtain a sparse convolution kernel, sparse decomposition [11] and pruning [13] techniques were proposed to sparsify a pre-trained dense CNN. [6] proposed an alternative where random sparse kernel was initialized before the training process. While they focus on speeding up CNNs, there have not been study on using these sparse convolutional layers in problem space dividing process, as traditional pixel-difference feature was used in decision trees.

3 Methodology

We first formulate the general DHM based on a hierarchical mixture of experts (HME) framework, then we specify the model for classification and regression experiments.

3.1 General framework of DHM

The general divide-and-conquer strategy consists of multiple levels of dividing operations and one final conquering step. The computation process is depicted as a tree where all leaf nodes are called conquering nodes while the others are named as dividing nodes. We index a node by a tuple subscript $(i,s)$ where $s$ denotes the vertical stage depth (see Figure 1) and $i$ denotes the horizontal index of the node. Every node has a non-negative integer number of children nodes, which forms a sequence $\mathcal{K}_{i,s}=\{\mathcal{K}_{i,s}^{1},\mathcal{K}_{i,s}^{2},...,\mathcal{K}_{i,s}^{|\mathcal{K}_{i,s}|}\}$ . Each node has exactly one input $\mathcal{I}_{i,s}$ and one output $\mathcal{O}_{i,s}$ .

A dividing node $\mathcal{D}_{i,s}$ is composed of a tuple of functions $(\mathcal{R}_{i,s},\mathcal{M}_{i,s})$ . The first function is called the recommendation function which judges the node input and gives the recommendation score vector $\mathbf{s}_{i,s}=\mathcal{R}_{i,s}(\mathcal{I}_{i,s})$ whose length equals the children sequence length $|\mathcal{K}_{i,s}|$ and the $j$ th entry $\mathbf{s}_{i,s}(j)$ is a real number associated with the $j$ th child node. We require

[TABLE]

so that $\mathbf{s}_{i,s}(j)$ can be considered as the significance or probability of recommending the input $\mathcal{I}_{i,s}$ to the $j$ th child node. The second function $\mathcal{M}_{i,s}$ is called mapping function and maps the input to form the output of the node $\mathcal{O}_{i,s}=\mathcal{M}_{i,s}(\mathcal{I}_{i,s})$ , which is allowed to be copied and sent to all its children nodes $\mathcal{K}_{i,s}$ .

We name the unique path from the root node to one conquering (leaf) node a computation path $\mathcal{P}_{i,s}$ . Each conquering node only stores one function $\mathcal{M}_{i,s}$ that maps its input into a prediction vector $\mathbf{p}_{i,s}=\mathcal{M}_{i,s}(\mathcal{I}_{i,s})$ , which is considered the termination of its computation path. To get the final prediction $\mathbf{P}$ , each conquering node contributes its output weighted by the probability of taking its computation path as

[TABLE]

and $\mathcal{N}_{c}$ is the set of all conquering nodes. The weight can be obtained by multiplying all the recommendation scores along the path given by each dividing node. Assume the path $\mathcal{P}_{i,s}$ consists of a sequence of $s$ dividing nodes and one conquering node as $\{\mathcal{D}_{i_{1},s_{1}}^{j_{1}},\mathcal{D}_{i_{2},s_{2}}^{j_{2}},\ldots,{\mathcal{C}_{i,s}}\}$ , where the superscript for a dividing node denotes which child node to choose. Then the weight can be expressed as

[TABLE]

Note that the weights of all conquering nodes sum to 1 due to (1) and the final prediction is hence a convex combination of all the outputs of conquering nodes. In addition, we assume every function mentioned above is a differentiable function parametrized by $\boldsymbol{\uptheta}_{i,s}^{\mathcal{R}}$ or $\boldsymbol{\uptheta}_{i,s}^{\mathcal{M}}$ for recommendation or mapping function at node $(i,s)$ . Thus the final prediction is a differentiable function with respect to all the parameters which we omit above to ensure clarity. A loss function defined upon the final prediction can hence be optimized with back-propagation algorithm and benefit from some frameworks that provide automatic differentiation.

A flexible feature in this framework is that, the recommendation functions $\mathcal{R}_{i,s}$ are in general not coupled with each other. [8, 19] pass the last fully-connected layer to sigmoid gates, whose results are used as recommendation scores in the dividing nodes (Figure 2 left). In this way all recommendation functions are evaluated simultaneously to give probabilities of taking all computation paths, even when most of the paths contribute little to the final results. On the other hand, our framework allows separation of the recommendation functions (Figure 2 right) so that we can avoid evaluating unnecessary computation paths.

We define a Probabilistic Pruning (PP) strategy based on the separability of the recommendation functions. Starting from the root dividing node, its children node will not be visited if their corresponding recommendation scores are lower than a pruning threshold $\mathcal{P}_{th}$ . This process recursively applies to its descendant dividing nodes and finally the more important computation paths are preserved. The process is depicted Algorithm 1.

Up to now, this model with decoupled recommendation functions is called Deep Hierarchical Machine (DHM) and Deep Sparse Hierarchical Machine (DSHM) is a specific form of DHM where a sparse feature extractor is used inside a node. For instance, $\mathcal{R}_{i,s}(\mathcal{I}_{i,s})=\mathcal{R}_{i,s}(\mathcal{G}(\mathcal{I}_{i,s}))$ where $\mathcal{G}$ only considers a small portion of input $\mathcal{I}_{i,s}$ .

3.2 Classification

For classification problem, the output $\mathbf{p}_{i,s}$ for each conquering node $\mathcal{C}_{i,s}$ is a discrete probability distribution vector whose length equals the number of classes. The $y$ th entry $\mathbf{p}_{i,s}(y)$ gives the probability $\mathbb{P}(y|\mathcal{I}_{0,0})$ that the root node input $\mathcal{I}_{0,0}$ belongs to class $y$ .

To train the model, we adopt the probabilistic generative model formulation [5] which leads to a maximum likelihood optimization problem. For one training instance which is composed of an input vector and a class label $\{\mathbf{x}_{i},y_{i}\}$ , the likelihood of generating it is,

[TABLE]

The optimization target is to minimize the negative log-likelihood loss over the whole training set containing $N$ instances $\mathbb{D}=\{\mathbf{x}_{i},y_{i}\}_{i=1}^{N}$ ,

[TABLE]

In this study, we constrain each dividing node to have exactly two children since we do not assume any prior knowledge on how many parts the input feature space should to be split into. We also assume a full binary-tree structure for simplicity. If some application-specific information such as clustering results are available, the tree structure can be adjusted accordingly. In the case of full binary tree, we can index each node with a single non-negative integer $i$ for convenience. The recommendation function in each diving-node only needs to give a 2-vector $\mathbf{s}_{i}$ and we use the short-hand $s_{i}$ to denote the probability the current dividing node input $\mathcal{I}_{i}$ is recommended to the left sub-tree. For a dividing node $\mathcal{D}_{i}$ , we denote nodes in its left and right sub-trees as node sets ${\mathcal{D}_{i}^{l}}$ and ${\mathcal{D}_{i}^{r}}$ , respectively. Then the probability of recommending the input $\mathbf{x}$ to a conquering node $\mathcal{C}_{i}$ can be expressed as,

[TABLE]

where $\mathcal{N}_{d}$ is the set of all dividing nodes and $\mathbbm{1}$ is an indicator variable for the expression inside the parenthesis to hold. For the classification experiments we use the simplest conquering strategy for each conquering node as in [8], where each conquering node gives a constant probability distribution $\mathbf{p}_{i}$ . The loss function is differentiable with respect to each $s_{i}$ , and the gradient for this full binary tree structure $\frac{\partial L(\mathbb{D})}{\partial{s_{i}}}$ is [8, 19, 17],

[TABLE]

This gradient can be passed backward into each dividing node to train its function parameters. Note that in our framework each $\mathcal{D}_{i}$ is generally decoupled with each other while in [8] and [19] all $\mathcal{D}_{i}$ come from the last layer of a deep model and are hence coupled. When the dividing nodes are fixed, the distribution at each conquering node can be updated iteratively [8],

[TABLE]

where $Q_{j}^{t}$ is a normalization factor to ensure $\sum_{y=1}^{|\mathbf{p}_{j}|}\mathbf{p}_{j}^{t+1}(y)=1$ . The backward propagation and the conquering nodes update are carried out alternately to train the model.

3.3 Regression

For regression problems, the output of a conquering node $\mathcal{C}_{i,s}$ is also a real-valued vector $\mathbf{p}_{i,s}$ but the entries do not necessarily sum to 1. The final prediction vector $\mathbf{P}_{i}$ for input $\mathbf{x}_{i}$ is,

[TABLE]

For a multi-task regression dataset with $N$ instances $\mathbb{D}=\{\mathbf{x}_{i},\mathbf{y}_{i}\}_{i=1}^{N}$ , we directly use the squared loss function,

[TABLE]

which was also used in the mixture of experts framework [12]. Here we use the same full binary tree structure and assume simple conquering nodes which have constant mapping functions just as the classification case. Similarly, $\frac{\partial L(\mathbb{D})}{\partial{s_{i}}}$ is computed as,

[TABLE]

where $\mathbf{A}_{l}=\sum_{\mathcal{C}_{j}\in\mathcal{D}_{i}^{l}}\mathbb{P}(\mathcal{C}_{j}|\mathbf{x}_{i})\mathbf{p}_{j}$ and $\mathbf{A}_{r}=\sum_{\mathcal{C}_{j}\in\mathcal{D}_{i}^{r}}\mathbb{P}(\mathcal{C}_{j}|\mathbf{x}_{i})\mathbf{p}_{j}$ . Similar to 8, we update the conquering node prediction as

[TABLE]

This update rule is inspired from traditional regression trees which compute an average of target vectors that are routed to a leaf node. Here the target vectors are weighted by how likely it is recommended into this conquering node.

4 Experiments

4.1 Classification for MNIST

We start with an illustration using MNIST. We compare the model architecture of [8, 19] with two variants of our proposed DHM as shown in Figure 3. The original architecture [8, 19] is denoted as NDF. NDF passes some randomly chosen outputs from the last fully-connected layer to sigmoid gates, whose outputs are used as the recommendation scores $s_{i}$ of each dividing node. The other two structures are detailed in the following subsections.

The MNIST data set contains 60000 training images and 10000 testing images of size 28 by 28 222https://pytorch.org/docs/0.4.0/_modules/torchvision/datasets/mnist.html. During the experiment, binary tree depth and tree number are set to 7 and 1, respectively. Adam optimizer is used with learning rate specified as 0.001. Batch size is set to 500 and the training time is fixed to 50 epochs. Every experiment is repeated 10 times and averaged results with standard deviation are reported.

4.1.1 Separated Recommendation Functions

This type of DHM separates each dividing node’s input and output, as shown in the middle column of Figure 3. Each dividing node processes the raw input image and produces a single number after the fully-connected layer, which is passed through a sigmoid function to give $s_{i}$ . One can think of this structure as the mapping functions for all dividing nodes are identity mappings $\mathcal{M}_{i,s}(\mathcal{I}_{i,s})=\mathcal{I}_{i,s}$ . We denote this type as DHM (separated). The final test accuracy of this and other types of models are summarized in Table 1. In addition, we estimate the computation load by the number of multiplication (NOM) operation needed in the convolution and linear layers, which is shown in Table 2.

4.1.2 Deeper Feature Along the Path

In this type of architecture, the root dividing node does more initial processing and reduces the size of the input images (See the right column of Figure 3). Other dividing nodes pass the processed feature maps to its children dividing nodes as inputs. Every dividing node also sends their flattened outputs to a linear and sigmoid layer to produce $s_{i}$ . The mapping function in this case can be seen as the local network without the last fully-connected layer. The intuition to use this topology is that the node input at larger depth will pass more dividing nodes and be processed more times. This type of model is denoted as DHM (connected).

4.1.3 Probabilistic Pruning

The distribution of $s_{i}$ during the training process is shown in Figure 4. Every bar plot contains 500 bins to quantize all dividing nodes’ $s_{i}$ values from 60000 training images. After initialization the distribution is centered around 0.5 while after longer training time, the dividing nodes are more decisive to recommend their inputs. When $s_{i}$ is very close to 1 or 0, the contribution from one of the two sub-trees is too low to be worthwhile for extra evaluation. This motivates the Probabilistic Pruning (PP) strategy which gives up evaluation of a sub-tree dynamically if the recommendation score of entering it is too low. NDF does not support PP even if the distribution strongly encourages it (see Figure 4 left), since all dividing nodes are coupled to the last fully-connected layer of the network. On the other hand, DHM can support PP naturally. In the experiment, we set the pruning threshold as 0.5 so that only one computation path is taken for every input image. The resulting test accuracy and NOM are shown in Table 1 and Table 2, respectively. Applying PP only sacrifices the testing accuracy negligibly but the computational cost is reduced from exponential to linear since now the most significant computation path determines the result. These results prove that DHM can take advantage of the distribution of recommendation scores.

The recommendation scores distribution for testing images before and after pruning is shown in Figure 5. Surprisingly, when a large amount of ”hesitating” dividing nodes are deterministically given which child-node to use, the accuracy was not affected significantly.

4.1.4 Adding Sparsity

Here we use local binary convolution [6] to add sparse feature extraction process into DHM, making it DSHM. Every original convolution layer is replaced by two convolution layers and a ReLU gate. The first convolution layer is fixed and does not introduce any learnable parameters. The output feature maps of the first layer is passed to the ReLU gate, whose outputs are linear combined by the second 1 by 1 convolution layer. During initialization, some entries in the convolution kernel of the first layer are randomly assigned to be zero. The remaining entries are randomly assigned to 1 or -1 with probability 0.5 for each option. The percentage of non-zero entries in the fixed convolution kernel is defined as the sparsity level. In the experiment, we use 16 intermediate channels (output feature map number of the first layer) for all local binary convolution layers. DHM (separated) is used and other network parameters are consistent with the former experiments without sparse convolution layer.

The resulting test accuracy and NOM is shown in Table 3. Since convolution with binary kernel can be implemented by addition and subtraction, the required NOM is further reduced. This experiment shows sparse feature extraction process can be seamlessly incorporated into DHM, which can be used in devices with limited computational resources.

4.2 Cascaded regression with DHM

Here we compare DHM with NDF architecture for a regression task, i.e., cascaded regression based face alignment. For an input image $\mathbf{x}_{i}$ , the goal of face alignment is to predict the facial landmark position vector $\mathbf{y}_{i}$ . Cascaded regression method starts with an initialized shape $\hat{\mathbf{y}}_{0}$ and use a cascade of regressors to update the estimated facial shape stage by stage. The final prediction $\hat{\mathbf{y}}=\hat{\mathbf{y}}_{0}+\sum_{t=1}^{K}\Delta\mathbf{y}_{t}$ where $K$ is the total stage number and $\Delta\mathbf{y}_{t}$ is the shape update at stage $t$ . In [7, 23], every regressor was an ensemble of regression trees whose leaf nodes give the shape update vector. Every splitting node in the regression tree locates two pixels around one current estimated landmark, whose difference was used to route the input in a hard-splitting manner. (see Figure 6)

We replace the the traditional regression trees with our DHMs that use a full binary tree structure so as to extend [7] with deep representation learning ability. During initialization, every dividing node is randomly assigned a landmark index. The input to a dividing node is then a cropped region centered around its indexed landmark. In the experiment we use a crop size of 60 by 60 and a simple CNN to compute the recommendation score, whose structure is shown in Figure 6. The comparison group uses traditional NDF architecture and we feed entire image as input (see Figure 7). Similarly, every conquering node store a shape update vector as in [7, 23] and (12) is used to update them.

We use a large scale synthetic 3D face alignment dataset 300W-LP [25] for training and the AFLW3D dataset (re-annotated by [2]) for testing. We use 57559 training images in 300W-LP and the whole 1998 images in the AFLW3D for testing. The images are cropped and resized to 224 by 224 patch using the same initial processing procedure in [2]. To rule out the influences of face detectors as mentioned in [16], a bounding box for a face is assumed to be centered at the centroid of the facial landmarks and encloses all the facial landmarks inside. We use the same error metric as [2] where the landmark prediction error is normalized by the bounding box size. In the experiment we use a cascade length of 10 and tree depth of 5 and in each stage we use an ensemble of 5 DHMs. We use the ADAM optimizer with a learning rate at 0.01 (0.001 for NDF as it works better for it in the experiment) and train 10 epochs for each stage. The average test errors of the two different architectures are shown in Table 4. Again, DHM supports PP to greatly reduce the computational cost and the performance only drops gracefully. This experiment validates again the strength of DHM over traditional NDF architecture in regression problems. Figure 8 shows some success and failure cases of this model. Compared with NDF, our DHM can significantly reduce the computational complexity after pruning with even slightly better alignment accuracy.

5 Conclusion

We proposed Deep Hierarchical Machine (DHM), a flexible framework for combining divide-and-conquer strategy and deep representation learning. Unlike recently proposed deep neural decision/regression forest, DHM can take advantage of the distribution of recommendation scores and a probabilistic pruning strategy is proposed to avoid unnecessary path evaluation. We also showed the feasibility of introducing sparse feature extraction process into DHM by using local binary convolution, which mimics traditional decision tree with pixel-difference feature and has potential for devices with limited computing resources.

Bibliography25

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Decision Forests for Computer Vision and Medical Image Analysis . Advances in Computer Vision and Pattern Recognition. Springer London, London, 2013.
2[2] A. Bulat and G. Tzimiropoulos. How far are we from solving the 2d amp; 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In 2017 IEEE International Conference on Computer Vision (ICCV) , pages 1021–1030, Oct 2017.
3[3] X. Cao, Y. Wei, F. Wen, and J. Sun. Face alignment by explicit shape regression. In 2012 IEEE Conference on Computer Vision and Pattern Recognition , pages 2887–2894, June 2012.
4[4] D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun. Joint cascade face detection and alignment. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, Computer Vision – ECCV 2014 , pages 109–122, Cham, 2014. Springer International Publishing.
5[5] M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural computation , 6(2):181–214, 1994.
6[6] F. Juefei-Xu, V. N. Boddeti, and M. Savvides. Local binary convolutional neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 4284–4293, July 2017.
7[7] V. Kazemi and J. Sullivan. One millisecond face alignment with an ensemble of regression trees. In 2014 IEEE Conference on Computer Vision and Pattern Recognition , pages 1867–1874, June 2014.
8[8] P. Kontschieder, M. Fiterau, A. Criminisi, and S. R. Bulò. Deep neural decision forests. In 2015 IEEE International Conference on Computer Vision (ICCV) , pages 1467–1475, Dec 2015.