Towards Real-Time Head Pose Estimation: Exploring Parameter-Reduced Residual Networks on In-the-wild Datasets
Ines Rieger, Thomas Hauenstein, Sebastian Hettenkofer, and Jens-Uwe, Garbas

TL;DR
This paper develops parameter-reduced Residual Networks for real-time head pose estimation, maintaining accuracy while enabling fast inference on in-the-wild datasets for human-computer interaction applications.
Contribution
It introduces modified ResNets with fewer parameters that achieve state-of-the-art accuracy and real-time performance in head pose estimation.
Findings
Reduced ResNets maintain accuracy comparable to larger models.
Modified ResNets achieve real-time inference speeds.
Models trained on in-the-wild datasets demonstrate robustness in real-world scenarios.
Abstract
Head poses are a key component of human bodily communication and thus a decisive element of human-computer interaction. Real-time head pose estimation is crucial in the context of human-robot interaction or driver assistance systems. The most promising approaches for head pose estimation are based on Convolutional Neural Networks (CNNs). However, CNN models are often too complex to achieve real-time performance. To face this challenge, we explore a popular subgroup of CNNs, the Residual Networks (ResNets) and modify them in order to reduce their number of parameters. The ResNets are modifed for different image sizes including low-resolution images and combined with a varying number of layers. They are trained on in-the-wild datasets to ensure real-world applicability. As a result, we demonstrate that the performance of the ResNets can be maintained while reducing the number of…
| ResNet Model | Input Size | Stacks | Layers | Parameters |
|---|---|---|---|---|
| ResNet34-112 | x pixels | [3,4,6,3] | 34 | 21.27 x |
| ResNet18-112 | x pixels | [2,2,2,2] | 18 | 11.17 x |
| ResNet18-64 | x pixels | [2,3,3,0] | 18 | 4.25 x |
| Angle | MAE | Std. Dev. | Category | Category |
| ResNet34-112, CPU: 8 fps, GPU: 100 fps, 21.27 x parameters | ||||
| Yaw | 56.0% | 93.8% | ||
| Pitch | 61.8% | 97.8% | ||
| Roll | 78.5% | 99.8% | ||
| Yaw (AFW) | 38.0% | 77.9% | ||
| ResNet18-112, CPU: 17 fps, GPU: 142 fps, 11.17 x parameters | ||||
| Yaw | 54.0% | 93.3% | ||
| Pitch | 62.4% | 98.1% | ||
| Roll | 78.4% | 99.8% | ||
| Yaw (AFW) | 42.6% | 83.8% | ||
| ResNet18-64, CPU: 50 fps, GPU: 250 fps, 4.25 x parameters | ||||
| Yaw | 53.4% | 93.4% | ||
| Pitch | 59.6% | 97.7% | ||
| Roll | 77.8% | 99.7% | ||
| Yaw (AFW) | 41.1% | 83.3% | ||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
11institutetext: Fraunhofer-Institute for Integrated Circuits IIS,
Am Wolfsmantel 33, 91058 Erlangen, Germany
11email: [email protected]
Towards Real-Time Head Pose Estimation: Exploring Parameter-Reduced Residual Networks on In-the-wild Datasets
Ines Rieger
Thomas Hauenstein
Sebastian Hettenkofer
Jens-Uwe Garbas
Abstract
Head poses are a key component of human bodily communication and thus a decisive element of human-computer interaction. Real-time head pose estimation is crucial in the context of human-robot interaction or driver assistance systems. The most promising approaches for head pose estimation are based on Convolutional Neural Networks (CNNs). However, CNN models are often too complex to achieve real-time performance. To face this challenge, we explore a popular subgroup of CNNs, the Residual Networks (ResNets) and modify them in order to reduce their number of parameters. The ResNets are modified for different image sizes including low-resolution images and combined with a varying number of layers. They are trained on in-the-wild datasets to ensure real-world applicability. As a result, we demonstrate that the performance of the ResNets can be maintained while reducing the number of parameters. The modified ResNets achieve state-of-the-art accuracy and provide fast inference for real-time applicability.
Keywords:
Head pose estimation Residual Network Real-time.
1 Introduction
Head poses are a key aspect of human non-verbal communication. As a consequence, automatic head pose estimation plays an important role in human-computer interaction. Several use cases in real-world scenarios include head pose estimation: In autonomous driving, head poses are used to estimate the driver’s level of attention. Inattentive drivers can then be encouraged to focus on the road again [16]. In order to diminish the risk of collisions, the attention level of the surrounding pedestrians can also be estimated from their head poses [1, 9]. Head pose estimation is further used as one of the key aspects in real-time human-robot interaction, e.g. in domestic environments [26] to provide a natural interaction mode with its users. In behavioural studies, head poses can be used to identify social groups [17] or a person’s target of interest [21]. Head poses are also part of the Facial Action Coding System (FACS) [8] to decode emotions and therefore contribute to the interpretation of facial expressions [13]. Thus, real-time head pose estimation is crucial for several real-world applications, but still faces challenges such as slow inference time or robustness for these settings. This paper aims at providing a real-time solution by exploring small Convolutional Neural Network (CNN) models for low resolution input images trained on in-the-wild datasets.
CNNs, a specialized kind of feed-forward neural network, have proven to be advantageous for various image and video processing tasks such as object detection or object recognition [18]. One particular successful CNN architecture is the Residual Network (ResNet) architecture [10], which provides effective training with very deep networks through shortcut connections. The shortcut connections enclose blocks of stacks of convolutional layers and enable a second way to propagate information forward and backward through the network. Veit et al. [31] proved that not all blocks contribute equally to the learning process by investigating the gradient flow. Presumably, a more shallow ResNet architecture could learn the same representations and perform as well as a deeper ResNet architecture. Based on this assumption and the overall success of ResNets, this paper contributes to the field of real-time head pose estimation by exploring various modified ResNets. We can summarize our key contributions as follows:
- •
We start to reduce the model parameters by adapting the original ResNet architecture for training with images of x pixels instead of x pixels. We further reduce the parameters by adapting the 18-layer ResNet for low-resolution images of x pixels. This 18-layer ResNet contains less parameters than the 18-layer ResNet originally proposed by He et al. [10].
- •
The modified ResNets are evaluated on two in-the-wild datasets: The Annotated Facial Landmarks in the Wild (AFLW) dataset [14] and the Annotated Faces in the Wild (AFW) benchmark dataset [34]. In-the-wild datasets ensure real-world applicability, which is important for use cases such as driver assistance systems or human-robot interaction. These datasets include no depth information.
- •
The performance of the implemented ResNets is evaluated with a five-fold cross-validation on the AFLW dataset and with a five time training-testing cycle on the AFW dataset. Multiple training cycles not only contribute to the robustness of the results, but also mitigate the non-deterministic behaviour of multi-thread training on GPUs. The results are measured in mean absolute error and accuracy.
- •
We compute the number of parameters and measure the inference time on a CPU and on a GPU. Low model complexity and the corresponding fast inference time is important for real-time applications.
The ResNet models are trained to estimate the head poses represented by Euler Angles, which measure the orientation of a rigid body in a fixed coordinate system [4].
2 Related Work
Head pose estimation approaches can be grouped in appearance-based methods, model-based methods and nonlinear regression methods.
Appearance-based methods compare new head images with a set of exemplary, annotated heads and pick the most similar one [2, 24]. Despite the advantage of a simple implementation and an easy extension for new heads, there is a huge disadvantage: The method is based on the premise that similar images also have similar head poses, and thus ignore the impact of identity.
In contrast to appearance-based methods, model-based methods follow a geometric approach by not taking the whole face into account, but only certain facial key-points. One approach uses the POSIT algorithm [3] to fit an averaged 3-dimensional facial model onto a 2-dimensional face image annotated with facial key-points, and then computes the head pose [14]. Another approach is to measure the distance of the facial key-points of the 2-dimensional image to a reference coordinate system [22]. The drawback of the model-based approaches is the need of high accuracy in the facial key-point detection. Estimating head poses from images with occluded face regions is therefore difficult.
To cover the complex feature space required for head pose estimation in images in in-the-wild settings, nonlinear regression methods can present a solution. The first nonlinear regression methods for head pose estimation were support vector regression [23], random forests [6, 5] and multilayer perceptrons (MLP) [28, 29]. With the rise of computational power, CNNs emerged around 2007 in the field of image based head pose estimation. In contrast to MLPs, CNNs display a high tolerance to shift and distortion variance. There are several recent approaches that employ the in-the-wild datasets AFLW and AFW for training and use the AFW dataset as a testing benchmark: Patacchiola and Cangelosi [25] compare various LeNet-5 [19] variants trained with different gradient and adaptive gradient methods. Ruiz et al. [27] train a 50-layer ResNet with a combined loss function of mean squared loss and cross entropy loss for all three angles. They achieve good results and outperform Patacchiola and Cangelosi. Kumar et al. [15] use a Heatmap-CNN (H-CNN) that learns local and global structural dependencies for detecting facial landmarks and estimating the head pose. The H-CNN includes Inception modules [30] that consist of parallel threads of stacked convolutional layers and therefore display an architecture similar to ResNets. Hsu et al. [12] train their multi-loss CNN based on a combined L2 loss regression and ordinal regression loss. To counteract the gimbal lock [20], an ambiguity problem in the Euler angle representation, they use quaternions as head pose representation. Hsu et al. [12] find that their pretrained Quaternion Net outperforms their network using Euler Angles for head pose representation. Wu et al. [32] train their combined face detection network on an augmented AFLW dataset combined with an unreleased own head pose dataset. An evaluation on the AFW dataset achieves state-of-the-art performance. Zhang et al. [33] use a cross-cascading regression network with two submodules, one for facial landmark detection and one for head pose estimation and achieve state-of-the-art performance on the AFLW dataset.
3 Residual Networks (ResNets)
The ResNet is one of the most popular architectures for image processing with very deep neural networks. The architecture was proposed by He et al. [10], who won benchmark competitions like the ImageNet Large Scale Visual Recognition Competition (ILSVRC) 2015111http://image-net.org/challenges/LSVRC/2015/, accessed 14.12.2018. and the Common Objects in Context (COCO)222http://cocodataset.org/#detections-challenge2015, accessed 14.12.2018. competition with an ensemble of ResNets. They have also proven the successful training with over one thousand layers [10].
3.1 Original ResNets
ResNets use the concept of shortcut connections with the effect that the input of the subsequent layers does not only contain the information of the immediate preceding layer, but from all preceding layers. This is contradictory to hierarchical CNNs, where the input of the layers does only contain information of the directly preceding layer. The shortcut connections can resolve the problem of vanishing or exploding gradients and the degradation of the training error in hierarchical CNNs. The degradation of the training error describes that the training and test error increases, when the network depth grows.
A shortcut connection is called an identity shortcut, when the input and output dimension stays the same within a block. Identity shortcuts do not introduce additional parameters in the network. When the dimension increases, He et al. [10] consider two options: (1) Identity mapping with extra zero entries or (2) projection shortcuts using x convolutions. As ResNets do not use as many filters as classic CNNs to achieve the same depth, they are parameter-reduced.
The residual block (see Fig. 1 (right side)) described by He et al. [10] uses the Rectified Linear Unit (ReLU) activation function with a preceding batch normalization layer. The shortcut connection encloses a block of two convolutional layers , where the input is added to the result of the two convolutional layers, resulting in in the forward propagation (see Eq. 1). This equation does not hold for blocks using the projection shortcut, but since the likelihood for such blocks is low, He et al. [10] do not expect this to have a great impact.
[TABLE]
In the backpropagation, the incoming gradient is split into two additive terms: The error is propagated through the shortcut connection and through the residual function, where the weights of the convolutional layers are adjusted. Since any gradient is a summation, they are not likely to vanish.
To enhance the performance of the ResNet, He et al. [11] propose a pre-activated residual block, where the Batch Normalization layer is placed before instead of after the convolutional layer. The presented experiments in this paper use the pre-activated form of the residual block.
3.2 Modified ResNets
The challenging aspect of real-time applications is the complexity of the trained models and the resulting computing time. To address this challenge, we explore three parameter-reduced ResNets of different depths and for different image sizes.
These ResNets (see Table 1) are modified to process images with a lower resolution of x pixels and x pixels instead of x pixels as in the original ResNet [11]. As a first step, the stride of the first convolutional layer is changed to one, so the size of the feature map is not reduced. This is different to the original ResNet, where a stride of two is used. Consequently, as the convolutional layer does not reduce the size of the feature map, the ResNet can process input images with a size of x pixels instead of x pixels. The ResNet34-112 and ResNet18-112 are modified to take x pixels as input. Their layers are divided in four stacks similar to [10]. The smallest proposed ResNet of He et al. [10] is a ResNet with 18 layers divided in four stacks. To further reduce the parameters, we propose the ResNet18-64, which uses only three stacks instead of four stacks. This allows low resolution inputs of x pixels while significantly decreasing the number of parameters.
The modified ResNets (see Fig. 1) use pre-activated residual blocks with projection shortcuts for an increase in dimensionality and identity shortcuts, when the dimension stays the same. In contrast to the original ResNet [11], the hyperbolic tangent function is used as the activation function instead of the ReLU function. The tangent function computes values in the range of [-1,1] and the labels and image pixel values are normalized to values in this range as well. Furthermore, the projection shortcuts for increasing the dimensions are placed after instead of before the first set of batch normalization and tangent activation layer, as the batch normalization provides a regularization effect of the image pixel values.
4 Datasets
To train the models for real-life settings, two in-the-wild datasets are used for training and testing.
4.1 Annotated Facial Landmarks in the Wild (AFLW)
The AFLW dataset [14] provides a large variety of different faces with regard to ethnicity, pose, expression, age, gender and occlusion. The faces are in front of natural background under varying lighting conditions. The dataset contains 25,993 annotated faces in 21,997 images. The license agreement does not allow publication of the AFLW database.333https://www.tugraz.at/institute/icg/research/team-bischof/lrs/downloads/aflw/, accessed 26.03.2019. All images are annotated with face coordinates and the three angles yaw, pitch and roll. 56% of the faces are tagged as female and 44% are tagged as male. Koestinger et al. [14] state that the rate of non-frontal faces of 66% is higher than in any other dataset. The distribution of poses of the AFLW dataset is not uniform, showing fewer images with a strong head rotation. The yaw angle has a range from to , the pitch angle from to and the roll angle from to . The head poses were computed with the POSIT algorithm using manually annotated facial key-points [3]. However, it is worth noting that the resulting head poses were not manually verified. The AFLW dataset has extremely wide ranges for all angles, which supersede realistic head movements by far [7].
4.2 Annotated Faces in the Wild (AFW)
The AFW dataset [34] shows a wide variety of ethnicity, pose, expression, age, gender and occlusion. The license agreement does not allow publication of the AFW database. The faces are positioned in front of natural cluttered backgrounds. There are 468 faces in 205 images. An annotation of the angles yaw, pitch and roll as well as face coordinates are provided. The yaw angle has a range from to , the pitch angle from to and the roll angle from to , all annotated manually in steps of . Since the yaw angle has the widest range, this angle is normally used when testing with this dataset.
5 Experiments
In this section, we describe the pre-processing, training parameters, evaluation methods and results of the trained models including a comparison to other state-of-the-art approaches regarding the performance and number of parameters.
5.1 Pre-processing
As explained in Section 4.1, some samples in the AFLW dataset are annotated with unrealistic values. Furthermore, few images are provided for extreme angles. Following the approach of Patacchiola and Cangelosi [25], we filter the dataset and only keep images in the following label ranges: for the yaw angle, for the pitch angle and for the roll angle. The yaw angle of the AFW dataset is also restricted to , as this angle is used for testing the trained networks. Both datasets are converted to greyscale. We crop the images using the annotated face coordinates. Each image is scaled down to x pixels and x pixels respectively. Face images smaller than the required size are left out. For the AFW dataset, the use of face images greater than 150 pixels is an additional constraint, following the protocol in [34]. To normalize the values, the labels are rescaled from to and the pixel values are rescaled from to .
In total, four datasets are prepared for our training: AFLW-112, AFLW-64, AFW-112 and AFW-64. The total amount of face images is 16,931 in AFLW-112, 20,872 in AFLW-64, 325 in AFW-112 and 352 in AFW-64.
5.2 Methods
The proposed ResNet architectures were implemented using TensorFlow and trained on a Nvidia Tesla P100 GPU. The pre-processing, training and evaluation is implemented in one pipeline. The convolutional weights are initialized with the variance scaling initializer and trained with an initial learning rate of 0.1, which is decreased by the factor 10 after 30, 60, 80 and 90 epochs. The weight decay of the L2 regularization excludes the loss of the batch normalization layers and has a value of 0.0002. We use a batch size of 256. All modified ResNets are trained separately for each angle.
There are two training and testing procedures: (1) Five-fold cross-validation with the AFLW-112 and AFLW-64 dataset and (2) Five training-testing cycles with training on the whole AFLW-112 or AFLW-64 dataset and testing on the AFW-112 or AFW-64 dataset (see Fig. 2). The ResNet18-112 and ResNet34-112 are trained for 200 epochs in both cases and the ResNet18-64 is trained for 120 epochs in case (1) and for 150 epochs in case (2). The epoch number was determined empirically.
The results are measured in mean absolute error (MAE) (see Eq. 2), where describes the predicted values and the true values in degrees. The number of testing examples is .
[TABLE]
As in other approaches, the predicted and true values are mapped on discrete categories with a size of (i.e. …,]-7.5,7.5],]7.5,22.5],…) to predict the accuracy. If the predicted value is in the same category as the true value, the predicted value is classified as correct, otherwise as incorrect. A further applied evaluation method considers the mapped predicted and true values as true, if it matches the true category or the adjoining categories. This gives a range of , where the predicted value can be classified as correct.
The mapping of the true and the predicted values on categories is problematic, because the cases where these values are located near the borders of the categories can distort the result. Furthermore, it is questionable, if the evaluation method with mapped categories error has high significance, as a wide range of degrees is considered as a correct prediction. The MAE on the other hand provides a clear interpretation of the results.
5.3 Results
Table 2 shows the average training results for the three modified ResNets, evaluated on the pre-processed AFLW and AFW datasets with the methods explained in Section 5.2. As presumed in the introduction, the comparison between the evaluated ResNets shows that their results for the three angles are quite similar. The results are similar, when tested on the AFLW dataset and when tested on the AFW dataset, only the ResNet34-112 shows worse results than the ResNet18-64 and ResNet18-112.
Since the distribution of correctly classified images across label ranges is important for applications using head pose estimation, heatmaps are also considered as an evaluation tool. The heatmaps (see Fig. 3) show that the ResNet18-64 displays a more uniform distribution over the categories than the ResNet18-112. In comparison to the ResNet18-112, the ResNet18-64 shows a higher percentage of correctly classified images in categories closer to and a similar percentage of correctly classified images in categories closer to .
The parameter number decreases with the reduction of the model’s complexity (see Table 2). As expected, the ResNet18-64 with 18 layers and an input image size of x pixels has the lowest number of parameters. The inference time was measured once on an Intel Core i7-6700 CPU running at 3.40GHz and once on this CPU equipped with a NVIDIA GeForce GTX 1060 6GB GPU. The frames per second (fps) rates of the different ResNets show a significant speed-up with the reduction of the model complexity and resulting decrease of parameters. The ResNet18-64 achieves 50 fps on the CPU, which is suitable for most real-time applications.
Other approaches evaluated on the AFLW and AFW datasets are summarized in Table 3. The number of parameters of [27] is based on their provided open source implementation, which is executable on a GPU based system.444https://github.com/natanielruiz/deep-head-pose, accessed 09.01.2019. In order to compare the frame rate, we reimplemented the LeNet-5 variant of [25]. In comparison, our ResNet18-64 has the lowest number of parameters while predicting more accurately than the LeNet-5 variant [25] and nearly as accurate as the ResNet50 [27]. Patacchiola and Cangelosi [25] also use low-resolution images with x pixels, while Ruiz et al. [27] take larger images with x pixels. To improve the computational efficiency, we believe that low-resolution images are better suited for real-world applications. Compared to our reimplementation of the LeNet-5 variant, our ResNet18-64 achieves a significantly higher frame rate on our CPU setup. Overall, our parameter-reduced ResNet18-64 achieves state-of-the-art precision and at the same time real-world applicability, even on CPUs.
6 Conclusion and Future Work
In this paper, we explored parameter-reduced Residual Networks (ResNets) of varying complexity for head pose estimation in order to achieve real-time performance. Based on the presumption that not all residual blocks contribute equally to the learning process, we showed that it is possible to reduce the number of parameters of the ResNet architecture while maintaining the performance. We proposed two new ResNet architectures for inputs of x pixels, one with 18 layers and one with 34 layers. To reduce the number of parameters even further, we proposed the ResNet18-64 with 18 layers for low resolution inputs of x pixels. The ResNet18-64 achieves real-time capability even on a CPU based system with a performance close to state-of-the-art results. To ensure real-world applicability, we evaluated the modified ResNets on the two in-the-wild datasets AFLW and AFW. In the future, it is possible to extend this approach to a model evaluating all three angles at once.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Benenson, R., Omran, M., Hosang, J., Schiele, B.: Ten years of pedestrian detection, what have we learned? In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8926, pp. 613–627. Springer, Cham (2015)
- 2[2] Beymer, D.: Face recognition under varying pose. In: CVPR. vol. 94, p. 137. Citeseer (1994)
- 3[3] Dementhon, D.F., Davis, L.S.: Model-based object pose in 25 lines of code. Int. J. Comput. Vision 15 (1–2), 123–141 (1995)
- 4[4] Diebel, J.: Representing attitude: Euler angles, unit quaternions, and rotation vectors. Matrix 58 (15-16), 1–35 (2006)
- 5[5] Fanelli, G., Dantone, M., Gall, J., Fossati, A., Van Gool, L.: Random forests for real time 3D face analysis. International Journal of Computer Vision 101 (3), 437–458 (2013)
- 6[6] Fanelli, G., Gall, J., Van Gool, L.: Real time head pose estimation with random regression forests. In: CVPR 2011. pp. 617–624. IEEE (2011)
- 7[7] Ferrario, V.F., Sforza, C., Serrao, G., Grassi, G., Mossi, E.: Active range of motion of the head and cervical spine: a three-dimensional investigation in healthy young adults. Journal of Orthopaedic Research 20 (1), 122–129 (2002)
- 8[8] Friesen, E., Ekman, P.: Facial action coding system: a technique for the measurement of facial movement. Palo Alto (1978)
