TL;DR
This paper introduces Ize-Net, an unsupervised deep learning model trained on a large web-sourced dataset to estimate eye gaze regions, demonstrating effective transfer to standard datasets and enhancing gaze estimation techniques.
Contribution
The paper presents a novel self-supervised learning approach for eye gaze estimation using a large in-the-wild dataset, enabling effective transfer learning and improved gaze region classification.
Findings
Ize-Net achieves competitive accuracy after fine-tuning on benchmark datasets.
The learned features improve traditional machine learning methods for gaze estimation.
Unsupervised training on web data reduces the need for labeled datasets.
Abstract
Automatic eye gaze estimation has interested researchers for a while now. In this paper, we propose an unsupervised learning based method for estimating the eye gaze region. To train the proposed network "Ize-Net" in self-supervised manner, we collect a large `in the wild' dataset containing 1,54,251 images from the web. For the images in the database, we divide the gaze into three regions based on an automatic technique based on pupil-centers localization and then use a feature-based technique to determine the gaze region. The performance is evaluated on the Tablet Gaze and CAVE datasets by fine-tuning results of Ize-Net for the task of eye gaze estimation. The feature representation learned is also used to train traditional machine learning algorithms for eye gaze estimation. The results demonstrate that the proposed method learns a rich data representation, which can be efficiently…
| Dataset | Center | Left | Right | Total |
| Train set | 32,450 | 38,230 | 37,338 | 1,08,018 |
| Validation set | 14,008 | 16,584 | 15,641 | 46,233 |
| Total | 46,458 | 54,814 | 52,979 | 1,54,251 |
| Method/ Network | CAVE |
|
||
|---|---|---|---|---|
| Eye Gaze heuristic | 60.37% | N/A | ||
| Alexnet | N/A | 88.22% | ||
| VGG-Face | N/A | 84.30% | ||
| Ize-Net | 82.80% | 91.50% |
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| k-NN | 9.26 | 6.45 | 6.29 | 3.73 | 3.69 | 2.36 | 3.31 | 3.26 | 2.42 | 2.48 | |||||||||||||||||||||||||||
| RF | 7.2 | 4.76 | 4.99 | 3.29 | 3.17 | ||||||||||||||||||||||||||||||||
| GPR | 7.38 | 6.04 | 5.83 | 4.07 | 4.11 | ||||||||||||||||||||||||||||||||
| SVR | - | - | - | - | 4.07 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Unsupervised Learning of Eye Gaze Representation
from the Web
Neeru Dubey Shreya Ghosh Abhinav Dhall
Learning Affect and Semantic Image analysIs (LASII) Group
Department of Computer Science and Engineering
- Indian Institute of Technology Ropar
*Ropar, India
{neerudubey, shreya.ghosh, abhinav} @iitrpr.ac.in
Abstract
Automatic eye gaze estimation has interested researchers for a while now. In this paper, we propose an unsupervised learning based method for estimating the eye gaze region. To train the proposed network “Ize-Net” in self-supervised manner, we collect a large ‘in the wild’ dataset containing 1,54,251 images from the web. For the images in the database, we divide the gaze into three regions based on an automatic technique based on pupil-centers localization and then use a feature-based technique to determine the gaze region. The performance is evaluated on the Tablet Gaze and CAVE datasets by fine-tuning results of Ize-Net for the task of eye gaze estimation. The feature representation learned is also used to train traditional machine learning algorithms for eye gaze estimation. The results demonstrate that the proposed method learns a rich data representation, which can be efficiently fine-tuned for any eye gaze estimation dataset.
I Introduction
The eye gaze estimation aims to determine the line-of-sight for the pupil. It provides information about human visual attention and cognitive process [1]. It aids several applications such as human-computer interaction [2], student engagement detection [3], video games with basic human interaction [4], driver attention modelling [5], psychology research [6] etc.
Gaze estimation techniques can be broadly classified into two types: intrusive and non-intrusive. The intrusive technique requires contact with human skin or eyes. It includes usage of head-mounted devices, electrodes and sceleral coils [7, 8, 9]. These devices provide accurate gaze estimation but cause an unpleasant user experience. The non-intrusive technique does not require physical contact [10]. Image processing based gaze estimation methods come under the non-intrusive category. These methods face a number of challenges, which include partial occlusion of the iris by the eyelid, illumination condition, head pose, specular reflection if the user wears glasses, etc; inability to use standard shape fitting for iris boundary detection; and effects like motion blur and over saturation of image [10]. To deal with these challenges, most of the accurate gaze estimation methods have been performed under constrained environments like fixation of head pose, illumination conditions, camera angle, etc. Such methods require huge dump of high-resolution labelled images. Robust gaze estimation needs accurate pupil-center localization. Fast and accurate pupil-center localization is still a challenging task [11], particularly for images with low resolution.
With the success of supervised deep learning techniques, especially convolution neural networks, much progress has been witnessed in most of the problems in computer vision. This is primarily due to the availability of graphics processing unit (GPU) hardware and large-sized labelled databases. Furthermore, it has been noted that the labelling of complex vision task is a noisy and erroneous process. Hence, there is an interest in exploring deep learning based unsupervised techniques for computer vision tasks [12, 13, 14, 15].
In this paper, we propose an unsupervised (self-supervised) technique for learning a discriminative eye-gaze representation. The method is based on exploiting the domain knowledge generated by analyzing YouTube videos. The aim is to learn a feature representation for eye gaze, which can be easily used by itself or fine-tuned for complex eye gaze related tasks. The experimental results show the effectiveness of our technique in predicting the eye gaze as compared to supervised techniques.
The main contributions of this paper are as follows:
Dataset (Figure 1) of 1,54,251 facial images of 100 different subjects from YouTube videos has been collected. These images are automatically labeled using the proposed method of eye gaze region estimation. 2. 2.
Propose a deep neural network, “Ize-Net”, which is trained on the proposed dataset. The result shows that unsupervised techniques can be used for learning rich representation for eye gaze. 3. 3.
Method to detect if the subject present in the input image, is looking towards his/her left, right, or center region. The gaze region estimation is calculated by utilizing the relative position of both (left and right) the pupils in the eye sockets. 4. 4.
Method to localize pupil-center using facial landmarks, OTSU thresholding [16] and Circular Hough Transformation (CHT) [17].
The remainder of this paper is organized as follows: Section II describes some of the related studies. Section III presents the details of the proposed pupil-center localization and gaze estimation methods. In Section IV, we empirically study the performance of the proposed approach. Section V contains the conclusion and future work.
II Related Work
The proposed method contains pupil-center localization and eye gaze estimation techniques. Accordingly, the literature survey demonstrates some of the relevant pupil-center localization and eye-gaze estimation methods.
The most popular solutions presented for the task of pupil-center localization can be broadly classified into active and passive methods [10].
The active pupil-center localization methods utilize dedicated devices to precisely locate the pupil-center such as infrared camera [7], contact lenses [8] and head-mounted device [9]. These devices require pre-calibration phase to perform accurately. They are generally very expensive and cause uncomfortable user experience.
The passive eye localization methods try to gather information from the supplied image/video-frame, regarding pupil-center. Valenti et al. [18], have used isophotes to infer circular patterns and used machine learning for the prediction task. An open eye can be peculiarly defined by its shape and its components like iris and pupil contours. The structure of an open eye can be used to localize it in an image. Such methods can be broadly divided into voting-based methods [19, 20] and model fitting methods [21, 22]. Although these methods seem very intuitive, they do not provide good accuracy. Several machine learning based pupil-center localization methods have also been proposed. One such method was proposed by Campadelli et al. [23], in which they used two Support Vector Machines (SVM) and trained them on properly selected Haar wavelet coefficients. In [24], randomized regression trees were used. These supervised learning based methods require the tiresome process of data labeling.
In this paper, we propose a method which overcomes the aforementioned limitations. It is a geometric feature-based pupil-center localization method, which gives accurate results for images captured under uncontrolled environment. In the past, various visible imaging based eye gaze tracking methods have been proposed which can be broadly classified among feature-based methods and appearance-based methods.
Feature-based methods utilize some of the prior knowledge to detect the subject’s pupil-centers from simple pertinent features based on shape, geometry, color and symmetry. These features are then used to extract eye movement information. Morimoto et al. [25] assumed a flat cornea surface and proposed a polynomial regression method for gaze estimation. In [26], Zhu and Yang extracted intensity feature from an image and used a Sobel edge detector to find pupil-center. The gaze direction was determined via linear mapping function. The detected gaze direction was sensitive to the head pose, therefore, the users must stabilize their heads. In [27], Torricelli et al. performed the iris and corner detection to extract the geometric features, which were mapped to the screen coordinates by the general regression neural network. In [18], Valenti et al. estimated the eye gaze by combining the information of eye location and head pose.
Appearance-based gaze tracking methods do not explicitly extract the features instead they utilize the whole image for eye gaze estimation. These methods, normally do not require the geometry information and calibration of cameras, since the gaze mapping is directly performed on the image content. These methods usually require a large number of images to train the estimator. To reduce the training cost, Lu et al. [28] proposed a decomposition scheme. It included the initial gaze estimation and the subsequent compensations for the gaze estimation to perform effectively using training samples. Huang et al. [29] proposed an appearance based gaze estimation method in which the video captured from the tablet was processed using HoG features and Linear Discriminant Analysis (LDA). In [30], an eye gaze tracking system was proposed, which extracted the texture features from the eye regions using the local pattern model. Then it fed the spatial coordinates into the Support Vector Regressor to obtain a gaze mapping function. Zhang et al. [31] proposed GazeNet which was deep gaze estimation method. Williams et al. [32] proposed a sparse and semi-supervised Gaussian process model to infer the gaze, which simplified the process of collecting training data.
In this paper, we propose an unsupervised method to detect whether a subject is looking towards his/her left, right or center region. We utilize the relative position of pupils in the eye sockets for judging the gaze region. This method allows us to predict the gaze region for a variety of images containing different textures, races, genders, specular attributes, illuminations and camera qualities.
III Proposed Method
This section explains the pipeline of proposed gaze region estimation method. At first, we localize pupil-centers. Then, utilize them to estimate the region of eye gaze using an intuitive approach which works well for images captured in the wild. Further, for learning eye gaze representation, we collect a large dataset of human faces. The domain knowledge based pupil-centers and facial points are used to create noisy labels representing gaze regions (left, right, or center). A network is then trained for the mentioned task. Later we show the usefulness of the representation learned from the fully automatically generated noisy labels.
III-A Dataset Collection
In recent years, several gaze estimation datasets have been published. Most of the datasets contain very less variety of images in terms of head poses, illumination, number of images, collection duration per subject and camera quality. To demonstrate that our proposed method is versatile, we collect a dataset containing 1,54,251 facial images belonging to 100 different subjects. The overall statistic of our dataset is shown in Table I. We download different types of videos from YouTube’s creative common section. These videos are basically of the category where a single subject is seen on the screen at a time, like news reporting, makeup tutorials, speech videos, etc. We have considered every third frame of the collected videos for dataset creation. For the training purpose, the dataset has been split into training and validation sets with 70% and 30% uniform partitions over the subjects. The overview of our proposed dataset has been shown in Figure 1. In this figure, we can observe that our dataset contains a huge variety of images with varying illumination, occlusion, blurriness, color intensity, etc. Table II provides a comparison of the state-of-the-art gaze datasets with our proposed dataset.
III-B The Pupil-Center Localization
Accurate pupil-center localization plays an important role in eye gaze estimation. We take face image as input and extract eyes from this image making use of the facial landmarks obtained by Dlib-ml library [33]. Further processing is performed on the extracted eye images. We localize pupil-center using two methods i.e. blob center detection and CHT; and take average of the pupil-centers obtained by both the methods to calculate the final pupil-center.
The steps of the proposed pupil-center localization method are explained below:
Extract eyes using facial landmarks information. 2. 2.
Apply OTSU thresholding on the extracted eyes to take advantage of unique contrast property of eye region while pupil circle detection. 3. 3.
Apply the method of blob center detection on extracted iris contours to calculate ’primary’ pupil-centers. 4. 4.
Crop regions near these centers, to perform the center rectification task. The crop length is decided by applying equation (1).
[TABLE] 5. 5.
Compute Adaptive thresholding and apply Canny edge detector [34] to make the iris region more prominent. 6. 6.
Apply CHT over the edged image to find secondary pupil-centers. 7. 7.
Compute average of primary and secondary pupil-centers to finalize the value for pupil-centers.
Empirically, we noticed that the pupil-center localization accuracy is increased by taking an average of pupil-centers calculated by the above two methods. Few sample results of pupil-center localization have been shown in Figure 2. The blue, green and pink dots represent the pupil-center obtained by our primary method, secondary method and their average, respectively.
III-C Heuristic for Eye Gaze Region Estimation
The Pupil-center is the most decisive feature of the face to determine gaze direction. Eyeballs move in the eye sockets to change the direction of the gaze. By using the relative position of both the pupil-centers, we can determine the region in which the subject is looking. When a subject looks towards his/her left, both the eyes’ iris shift towards left. To utilize this characteristic, we compare the angles which are formed when we join left pupil-center with nose and nose with vertical; with the angle which is formed when we join right pupil-center with nose and nose with vertical. These angles are demonstrated in Figure 3 as angles and . For a subject to look towards his/her left region, the left eye angle has to be bigger than the right eye angle . This intuitive heuristic is used to detect the region (left, right, or center) in which the subject is looking. The proposed method is immune to head movements within the range of [math] to [math].
The eye corners remain fixed with the eye movement. We utilize the eye corner points, given by Dlib-ml library, to determine the head pose direction, in the same way as we determine the eye gaze region. The angles used to determine the head pose direction are demonstrated in Figure 3 as angles and .
III-D Proposed Network Architecture
Deep Neural Network (DNN) is well known to perform exceptionally well at handling visual recognition tasks. DNN predicts a face by detecting a bunch of randomly assembled face parts. Therefore, it is not suitable to be used when we utilize the relative position of face parts for classification purpose. To overcome this problem, the Capsule Network [41] was proposed. Capsule Network constraints the relative position of face parts. The proposed method takes the face symmetry into consideration while detecting the eye gaze region. To combine the advantages of both the networks, we propose “Ize-Net” network. The architecture of the proposed network is shown in Figure 4. This network is trained using images of size . We have taken entire face as input instead of only eyes. According to [42], gaze can be more accurately predicted when the entire face is considered. Our proposed network contains five Convolution layers. Each Convolution is followed by Batch Normalization and Max-Pooling. For Batch Normalization, we use ’ReLU’ as the activation function. For Max-Pooling kernel of size () was used. The stride of () is considered for each layer. After the Convolution layers, we append Primary Capsule, whose job is to take the features learned by Convolution layers and produce combinations of the features to consider face symmetry into account. The output of the Primary Capsule is flattened and fed to Fully-Connected (FC) layers of dimension 1024 and 512. In the end, we apply ‘Softmax’ activation to produce the final output.
III-E Dataset Specific Fine Tuning
The Ize-Net network is trained on the proposed dataset for the task of gaze region estimation. The learned data representation can be fine-tuned over any specific dataset for determining the exact gaze location. In the experiments section, we demonstrate the various level of fine tuning results for Tablet gaze and CAVE datasets. The high accuracy of the experimental results demonstrate that the proposed method learns a rich data representation. The learned data representation can be directly used for gaze region estimation or it can also be fine tuned for exact gaze estimation task over a specific dataset.
IV Experiments
For experimental purpose, we use the Keras deep learning library with the tensorflow backend. The code, dataset and model is available online111https://github.com/Neerudubey/Unsupervised-Eye-gaze-estimation.
IV-A Validation of Pupil Localization
The pupil-center detection is performed using OTSU thresholding with blob center detection and CHT. To perform CHT, we crop the image around the pupil-center which we detect using OTSU thresholding and blob-center. We use offset of 5 pixels to crop the image. We validate the proposed pupil-center localization method (Section III-B) on BioID dataset [45]. BioID is a publicly available dataset which contains 1,521 frontal face images of 23 subjects. The evaluation protocol is mentioned in equation 2, is same as the one used in [45].
[TABLE]
where, is the error term, and are the Euclidean distances between the localized pupil-centers and the ground truth ones; and are left and right pupil-centers respectively in the ground truth.
We neglect some of the images, where Dlib-ml failed to detect the face or any of the eye contours. Table VI shows the comparison of the proposed method with some of the state-of-the-art methods. This table shows that our method is absolutely accurate in and cases, but it does not perform well enough when . The reason behind this is the inaccurate circle detection by CHT which propagates the error while averaging primary and secondary pupil-centers (Section III-B).
IV-B Validation of Eye Gaze Region Estimation
The efficiency of the proposed eye gaze region estimation heuristic is validated on CAVE dataset [35]. For this purpose, we map the given angular labels of CAVE dataset images into left, right and center gaze regions based on the sign (positive and negative) of the gaze point mentioned. The validation results are shown in Table III. After the heuristic evaluation, we also evaluate the performance of Alexnet [50] and VGG-Face [51] networks on the collected dataset. It gives 88.22% validation accuracy for Alexnet and 84.30% validation accuracy for VGG-face. For training both the networks, we use Stochastic Gradient Descent (SGD) optimizer with categorical cross-entropy as loss function. The learning rate and momentum are assigned 0.01 and 0.9 values respectively.
IV-C Performance of Ize-Net Network
For training the proposed Ize-Net network, we initialize the network weights with ‘glorot normal’ distribution. We use SGD optimizer with learning rate 0.001 with the decay of per epoch. We use categorical cross-entropy as loss function to train the proposed network. As mentioned in TABLE III, it gives 91.50% accuracy on the validation data of the proposed dataset. The proposed network outperforms the efficiency of AlexNet and VGG-face networks. The primary reason behind the better performance of Ize-Net is the presence of the primary capsule. This enables the network to consider the geometry of face into account during gaze region prediction. The consideration of face geometry is in accordance with the proposed heuristic used to label the images of the collected dataset. We validate the performance of the proposed network on CAVE dataset. The angular labels of CAVE dataset images have been mapped into three gaze regions. Post categorizing the images into their corresponding gaze regions we fine tune the Ize-Net for entire CAVE dataset to cross-check the performance of this network. We fine-tune our network for 10 epochs with 0.0001 learning rate [35]. As mentioned in TABLE III, our network gives 82.80% five-fold cross-validation accuracy on CAVE dataset.
IV-D Fine Tuning Results on Tablet Gaze and CAVE Datasets
To fine tune the base model for prospective datasets, we add two Fully-Connected (FC) layers at the end of the proposed Ize-Net network. We fine-tune the network on Tablet Gaze and CAVE datasets. The two FC layers added in the base network are each of dimension 256 for both the datasets. The fine tuning results for Tablet Gaze are shown in TABLE IV and those for CAVE are shown in TABLE V. As depicted in TABLE IV and V, we demonstrate the results for different levels of fine tuning. Last 8 FC layers, last 12 FC layers and complete network are fine-tuned one-by-one for the empirical analysis of results. For fine-tuning the proposed network for Tablet Gaze dataset, we used a learning rate of 0.0001 with 10 epochs and for CAVE dataset, we used a learning rate of 0.0001 with 15 epochs. For both the datasets, the fine tuning is done with mean square error loss function. The experimental results demonstrate that the proposed method outperforms the state-of-the-art gaze prediction for both Tablet Gaze and CAVE datasets. For experiments, we try our best to follow the protocols discussed in [43] and [29]. However, there can be a few differences in frame extraction and selection. To demonstrate that the network learned efficient features, we trained a Support Vector Regressor (SVR) over the features learned in 31st FC layer and 34th FC layer for Tablet Gaze dataset. As depicted in TABLE IV, the low gaze prediction errors of SVR confirms that the learned features are highly efficient.
V Conclusion and Future Work
In this paper, we propose a method which learns a rich eye gaze representation by using unsupervised learning technique. Using the relative position of pupil-centers in left and right eye, the images are labeled based on gaze region i.e. left, right, or center. To demonstrate the robustness of the proposed method, we collect a large dataset of facial image. We also propose Ize-Net network, which is trained on the collected dataset. The weights of this trained model can be used for any facial image to detect the region of gaze. Machine learning methods can be used on the learned gaze region representation to calculate eye gaze. Experimental results confirm the efficiency of the proposed method.
The proposed gaze estimation method can be vastly used for many human-computer interaction based applications without prior need of troublesome data labelling task.
Currently, our method is robust to the head pose movement within [math] to [math]. In the future, we plan to utilize the head pose information completely while estimating the gaze region. We also plan to perform the real-time pupil-center localization and gaze region estimation for a video-based dataset.
Acknowledgement
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Mason, B. Hood, and C. Macrae, “Look into my eyes: Gaze direction and person memory,” Memory , 2004.
- 2[2] B. Ghosh, A. Dhall, and E. Singla, “Speech-gesture mapping and engagement evaluation in human robot interaction,” ar Xiv , 2018.
- 3[3] A. Kaur, A. Mustafa, L. Mehta, and A. Dhall, “Prediction and localization of student engagement in the wild,” in IEEE Digital Image Computing: Techniques and Applications , 2018.
- 4[4] P. Barr, J. Noble, and R. Biddle, “Video game values: Human–computer interaction and games,” Interacting with Computers , 2007.
- 5[5] L. Fridman, P. Langhans, J. Lee, and B. Reimer, “Driver gaze region estimation without use of eye movement,” IEEE Intelligent Systems , 2016.
- 6[6] E. Birmingham and A. Kingstone, “Human social attention,” Annals of the New York Academy of Sciences , 2009.
- 7[7] D. Xia and Z. Ruan, “IR image based eye gaze estimation,” in IEEE ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing , 2007.
- 8[8] D. Robinson, “A method of measuring eye movemnent using a scieral search coil in a magnetic field,” IEEE Transaction on Bio-Medical Electron. , 1963.
