TL;DR
This paper introduces a novel Riemannian manifold approach to analyze face video signals for photoplethysmography imaging, improving robustness and simplicity in heart rate estimation.
Contribution
It develops a topology change based on group invariance principles, enabling parameter-free, low-complexity feature extraction for PPGI from face videos.
Findings
Achieved robust heart rate estimation on public face video datasets.
Unifies invariance properties with a low-dimensional embedding.
Operates implicitly without prior knowledge or parameter tuning.
Abstract
We study the vector space of visible wavelength intensities from face videos widely used as input features in Photoplethysmography Imaging (PPGI). Based upon theoretical principles of Group invariance in the Euclidean space we derive a change of the topology where the corresponding distance between successive measurements is defined as geodesic on a Riemannian manifold. This lower dimensional embedding of the sensor signal unifies the invariance properties with respect to translation of the features as discussed by several former approaches. The resulting operator acts implicit on the feature space without requiring any kind of prior knowledge and does not need parameter tuning. The resulting feature's time varying quasi-periodic shaping naturally occurs in form of the canonical state space representation according to the known Diffusion process of blood volume changes. The…
| Database | Green | SSR | POS | LGI | SPH |
|---|---|---|---|---|---|
| UBFC | 0.16/22.1 | 0.54/4.95 | 0.68/4.42 | 0.75/5.94 | 0.73/3.21 |
| LGI Resting | 0.41/2.61 | 0.49/1.99 | 0.41/2.10 | 0.69/1.41 | 0.71/1.49 |
| LGI Rotation | 0.15/13.2 | 0.06/10.9 | 0.12/5.32 | 0.67/1.92 | 0.56/2.54 |
| LGI Gym | 0.01/33.5 | 0.03/21.2 | 0.15/12.2 | 0.42/2.65 | 0.26/3.65 |
| LGI Talk | 0.15/46.6 | 0.12/27.8 | 0.01/37.7 | 0.51/14.7 | 0.23/27.8 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
On the Vector Space in
Photoplethysmography Imaging
Christian S. Pilz
CanControls GmbH Aachen, Germany
[email protected] &Vladimir Blazek
RWTH Aachen, Germany
[email protected] &Steffen Leonhardt
RWTH Aachen, Germany
[email protected] http://www.cancontrols.comhttp://www.medit.hia.rwth-aachen.dehttp://www.medit.hia.rwth-aachen.de
Abstract
We study the vector space of visible wavelength intensities from face videos widely used as input features in Photoplethysmography Imaging (PPGI). Based upon theoretical principles of Group invariance in the Euclidean space we derive a change of the topology where the corresponding distance between successive measurements is defined as geodesic on a Riemannian manifold. This lower dimensional embedding of the sensor signal unifies the invariance properties with respect to translation of the features as discussed by several former approaches. The resulting operator acts implicit on the feature space without requiring any kind of prior knowledge and does not need parameter tuning. The resulting feature’s time varying quasi-periodic shaping naturally occurs in form of the canonical state space representation according to the known Diffusion process of blood volume changes. The computational complexity is low and the implementation becomes fairly simple. During experiments the operator achieved robust and competitive estimation performance of heart rate from face videos on two public databases.
K****eywords Photoplethysmography Imaging Computer Vision Linear Algebra Group Theory Manifolds
1 Introduction
Nearly 70 years ago, short after the end of world war II in 1948, Norbert Wiener published, his sociopolitical often controversial discussed work, Cybernetics or Control and Communication in the Animal and the Machine [1]. During this period and short after, mankind already discovered most of the important principles found today in everyday technology. Although we have not seen yet a fully functional realization of Wieners visions of self-regulating mechanisms, but we are able to trace ongoing and fast emerging progress in the computational interpretation of sensors, signals and systems which yields at least to a direction of a soupçon of artifical intelligenz.
As part of natural social interaction the human face with it’s contingent of verbal activity, non-verbal behaviour as well as it’s appearance is reflecting the majority of interpretable signal sources by computer systems. In contrast to the non-verbal behavioral signals the facial appearance is able to provide changes in peripheral nervous system as well as central nervous system surrogate states by the analysis of skin blood perfusion. Though the tiny intensity changes can not be perceived by human eyes. However with the help of opto-electronic circuitry this information has become accessible. Scientifically a plain disposability is nothing special. Under the context of non-obtrusive remote measurability this new technology enables acquisitions of biological and cognitive human data under exceptional situations potentially leading to new findings and application fields. Basically, nearly every tasks targeting on the specific dependent variables, formally captured by cable mounted sensors sticked onto human skin, can be sensed by the camera based counterpart as well. The possible range of new applications and analysis legitimates to name its potential as ground breaking. The topic is traditional anchored in the medical sciences. However, currently the focus elementary changed into direction of computer vision. Here the role of physiological states has a large impact. It is primarily used during human state computing tasks, where the radiation unnoticeable transports information from face without contact holding states of affective nature.
During the last years measuring blood volume changes and heart rate measurements from facial images gained attention at top computer vision conferences [2, 3, 4, 5, 6, 7] frequently. Most of these contributions focus on how to cope with motion like head pose variations and facial expressions since any kind of motion on a specific skin region of interest will destroy the raw signal in a way that no reliable information can be extracted anymore. Beside from being able to estimate vitality parameters like heart rate and respiration, the functional survey of wounds as well as quantification of allergic skin reaction [8] are further topics of discovered employments of skin blood perfusion analysis. Recently, prediction of emotional states, stress [9, 10, 11], fatigue [12] and sickness [13] became interesting new achievements in this area, pushing the focus of this technology further towards human-machine interaction.
In contrast to the genuine medical use-case of the technology, in computer vision and human-machine interaction we can’t expect any cooperative behavior of the user without introducing lack of convenience and a reduction of the general user acceptance. Further, beyond any well tempered clinical and laboratory like scenarios, the majority application will face strong challenging environmental changes and differences much more quite common. Thus, there’s an emerging demand to produce better features and models significant more robust to nuisance factors, still preserving the desired target information. To reach such a formulation a fundamental profound understanding of the underlying optical and mathematical properties is one of the current foci of this research discipline.
The main contribution of this work is a mathematically analysis of the PPGI’s feature space determined over facial skin pixels. The general aim is to study the properties and behaviour of the features with respect to the influence of the actions of the Group acting on this space when induced by natural head and face motion. As result a new feature operator is developed and evaluated against common operators on various data sets. These sets are collected to explicitly study the influence of typical nuisance factors. To support an efficient dissemination and to speed up the research progress in this field we have encapsulated all feature operators and the reference implementation of the stochastic model of blood volume changes into a new object-orientated MATLAB toolbox. The code is public available under http://bit.ly/PPGI-Toolbox. The toolbox provides the necessary code to reproduce all results presented in this work.
The outline of this work is as follows. From the historical genuine up to the development of the state of the art in computer vision and bio-medical engineering, the methodology of heart rate estimation from face videos will be reviewed. Followed by theoretical aspects, the feature space will be analyzed and the proposed methodology mathematically described. Based upon an extensive evaluation on different databases the results will be presented and finally discussed.
2 Related Work
The historical genuine of the term Photoplethysmography, short PPG, dates back to the late first half of the 20th century, when the two scientists Molitor and Kniazak [14] recorded peripheral circulatory changes in animals. A year later, Hertzman [15] introduced the term Photoelectric Plethysmograph as "the amplitude of volume pulse as a measure of the blood supply of the skin". Hertzman’s instrumentation comprised mainly of a tungsten arc lamp and a photomultiplier tube. During the same time the methodology was described by a German scientist in a medical journal [16]. However, today it is not reproducible anymore who really invented this technology. In literature Hertzman is accredited with this reputation usually.
Around 50 years later the advancement to the classical PPG, the camera based PPGI (with I for Imaging) method, was introduced by the pioneering work of Blazek [17]. The basic principle behind the measurement of blood volume changes in the skin by means of PPGI (as well as PGG too) is the fact that the oxygen binding ferrous protein complex hemoglobin in the blood absorbs specific frequency bands of light many times more strongly than the remaining skin tissues. Accordingly, tiny intensity changes can be observed over specific frequency bands (e.q. the density of spectral lines of the emission spectra of iron) as oscillation caused by the quasi periodic rhythm of the human heart. In PGG a part of the skin surface is illuminated by dedicated light sources like illumination panels consisting of LED. In PPGI a common CCD camera is used as detector and the illumination can be as well as a common ambient light for which the intensity of backscattered optical radiation, eq. reflected light, is calculated [18, 19, 20].
In general the computational pipeline to determine vitality parameters and its derivatives from blood volume changes can be regarded as classical signal processing chain. Typically, from a skin region of interest (ROI) features are calculated, filtered and analyzed by spectral methods [19, 21, 20]. The first published visualization of pulsatile skin perfusion patterns in the time and frequency domain is given by Blazek [17]. However motion of the skin ROI [19] and micro motion of the head due to cardiac activity [22, 23] inherently induces artifacts into the extracted signal, especially when lighting is neither uniform nor orthogonal [24]. Canceling motion artifacts during signal processing became one of the most important aspect for reliable skin blood perfusion measurements. An early idea of skin ROI motion compensating is to track every skin pixel position by optical flow methods directly in the image plane [19]. However this doesn’t account for any change of illumination. Poh et al. [21] proposed to extract motion components in the signal by blind source separation using Independent Component Analysis (ICA) over the different color channels. Wedekind et al. [25] compared ICA in multiple setting and Principal Component Analysis and showed limitations of either transform. Further, the ICs cannot be obtained in a deterministic order [26]. A solution to this problem is discussed by Macwan et al. [27]. Tarassenko et al. [28] tried to cope with light flicker by using an auto-regressive modeling and pole cancellation. De Haan and Jeanne [29] and De Haan and Van Leest [30] proposed to map the PPGI-signals by linear combination of RGB data to a direction that is orthogonal to motion induced artifacts. An alternative approach, which does not require skin-tone or pulse-related priors in contrast to the channel mapping algorithms, determines the spatial subspace of skin-pixels and measure its temporal rotation for signal extraction [31]. Tulyakov et al. [4] proposed matrix completion to jointly estimate reliable regions and heart rate estimates whereby Li et al. [2] applied an adaptive least square approach to extract robust pulse frequencies. Both reported performance gains similar to De Haan and Jeanne [29]. Interestingly they used the often criticized compressed videos of the MAHNOB-HCI database [32] during their experiments. This leaves reasonable doubts on the validity of results since it is well known that any kind of image compression will destroy the underlying tiny perfusion signal [33]. Wang et al. [34] reported an orthogonal behavior of skin color and motion artifacts derived by optical properties but introduced a static projection operator for feature transformation and represented their results on private data. An entirely different model was introduced by Pilz et al. [5]. Here, the quasi-periodic nature of the blood volume changes is modeled as stochastic resonator based upon a diffusion process. A group theoretic deviated feature transform for motion compensation is introduced by Pilz et al. [6]. Following the popularity of Deep Learning Chen and McDuff [7] claim to outperform recent algorithms using a convolutional network architecture (CNN) for modelling motion representation. However, they also reported some findings on the compressed videos of the MAHNOB-HCI database and they didn’t provide their CNN implementation or at least the trained model yet.
3 Methodology
Meanwhile, it is well understood that subject motion and fast strong changes of illumination alters the distribution of pixel intensity negatively making it quite difficult to extract skin perfusion signals from video images. It is assumed that the perfusion signals exits in either case and its further assumed that it is combined together with the distribution of intensity belonging to motion forces by some unknown operator. Following the basic principles of the Hilbert projection theorem. If is a closed subspace of the Hilbert Space and , then there is a unique element such that
[TABLE]
and only if and where is the orthogonal complement to in. Then, the current paradigm in understanding PPGI signal components in the feature space assumes that blood volume changes exist in a lower dimensional space where this space is orthogonal to any kind of motion induces signal components. Derived from optical properties by De Haan et al. and Wang et al. [29, 34] and from Group theoretic principals by Pilz et al. [6] this can be expressed as . and are two orthogonal vectors with , thus statistically linear independent. This principal is illustrate in Figure 2. In the following we explain the Group theoretic principals behind motion robust sensing of blood volume changes as introduced by Pilz et al. [6]. Based upon the analysis of the properties of the resulting linear operator we demonstrate that there exists an equivalent implicit operator. This operator maps the observation onto an embedded Riemannian submanifold of the Euclidean space. And we show that the corresponding directional statistics evolve in form of the previously published Diffusion process model as function of time [5].
3.1 Basic Principals of Group Invariance
Consider a finite topological group
[TABLE]
of distinct actions on a topological space
[TABLE]
A real valued function on is said to be invariant under if
[TABLE]
Regarding a common optical sensor signal
[TABLE]
as spatial expectation over a skin operator and function of time
[TABLE]
we assume this multivariate observation is drawn by a normal distribution
[TABLE]
Local invariance of blood volume changes as function of time for each input feature under transformations of a differentiable local group of local transformations [35]
[TABLE]
can be approximately enforced by minimizing the regularizer
[TABLE]
For the covariance matrix of the observation with respect to the transformations
[TABLE]
and the corresponding symmetric eigenvalue problem
[TABLE]
we find an operator with corank for
[TABLE]
and the corresponding feature vector
[TABLE]
The observation is defining the null space of the projection operator
[TABLE]
3.2 The embedded Riemannian submanifold
In general, we’re specially interested in the properties of the projection operator and the resulting linear subspace since the direction of is assumed to carry most of the PPGIs motion signal component and the subspace the desired quasi periodic perfusion component. It is a well-known from the theory of Banach algebras that the spectral radius of any is given by Gelfand’s formula
[TABLE]
for any matrix norm on . For a matrix, the spectrum is just the collection of eigenvalues with . For the projection operator , where is represented by the eigenvector with the largest eigenvalue, this implies a reduction of the spectral radius of the observation
[TABLE]
The projection removes the direction of the largest variance and puts more emphasis on the directions which vary less. It should be clear that the computation of the projection is an explicit operator which has to be estimated on an observation . A more convenient way of incorporating invariance to the feature space would be to define an implicit operator.
Let be an eigenvalue of , and let be a corresponding eigenvector. From , we have where It follows , and simplifying by gives . Taking the maximum over all eigenvalues results in .
Now, regarding the optical sensor signal relative to the spectral radius of its elements results in
[TABLE]
with
[TABLE]
and distributed on the unit sphere as embedded Riemannian submanifold of the Euclidean space .
Intuitively, for real valued observations the mean is given by . However, for the more general settings we do not have the possibility to find such a closed form solution. The solution has to be solved for the optimization problem
[TABLE]
by gradient descent algorithm. The resulting is called Riemannian center of mass or Karcher mean [36]. The optimality conditions is give by
[TABLE]
Considering the embedded dimensions of the unit sphere given by its spherical coordinates
[TABLE]
[TABLE]
[TABLE]
the directional statistics are represented by the von Mises-Fisher distribution
[TABLE]
with the normalization constant given by
[TABLE]
where denotes the modified Bessel function of the first kind. The spherical direction evolves according to the principals of the stochastic differential equation given by
[TABLE]
whereby can be expressed as part of a Wiener process given by
[TABLE]
For a detailed discussion on the Diffusion process see the previous works of Pilz et al. [5].
4 Experiments
The evaluation of the described feature operator was conducted by experiments against different methods on two public available databases. We decided to compare the operator against the baseline green channel expectation [19, 20], the Spatial Subspace Rotation (SSR) [31], the Projection Orthogonal to Skin (POS) [34] and the Local Group Invariance (LGI) [6] method. All experiments were executed on the French UBFC-RPPG [37] and the German LGI Multi-Session face video database [6]. In the following Figures 3(a) and 3(b) some representative images of these two databases are illustrated. The UBFC and the LGI
database were created using a custom C++ application for video acquisition with a simple low cost Logitech webcam and a CMS50E transmissive pulse oximeter to obtain the ground truth PPG data comprising of the PPG waveform. The total amount of face video recordings yields to 150 sequences containing several couple of minutes respectively. During the recordings, the subjects of the UBFC trials performed moderate face and head motions under indoor environments whereby the LGI recordings contain scenarios from resting and head rotation over sport activities to natural outdoor conversations. Therefore, the evaluation concept ranges from cooperative to challenging scenarios which should be become noticeable in form of the prediction accuracy results.
The primary signal processing procedure is selected to be equal for every approach and database. For each video frame a common Viola-Jones face finder was used to pre-select the region of interest. A simple skin operator was applied onto the region by thresholding the blue- and red-difference chroma components. For the set of obtained RGB-pixels the different approach specific operators were computed and stored as time series for further spectral processing and interpretation. Each signal obtained by the different algorithms was band-filtered in the range between 0.5 and 2.5 Hz. All filtered signals were then analyzed by standard Fourier based spectral method with windows size of 256 samples and overlap of 90 percent. A maximum peak energy criterion was applied over the spectral traces to determine the heart rate. All PPG ground truth signals were analyzed in the same way but initially resampled to the camera frame rate. Correlation coefficients were computed against the PPG reference heart rate together with the root-mean-square error (RMSE) for each user and algorithm respectively. We did not perform a Signal-to-noise ratio (SNR) comparison as proposed by De Haan and Jeanne [29] and often used by several other authors [30, 34, 7]. This might be useful on short video sequences with a more or less stationary frequency behavior of blood volume changes. A prospective consideration of a SNR metric for system evaluation should at least include information about it’s variance computed on short term spectra.
Table 1 represents the overview of heart rate prediction accuracy of the different operators on the different data sets. Compared to the baseline green channel, SSR and POS approach the LGI and the proposed spherical operator (SPH) are more robust especially under motion scenarios. Although the SPH operator cannot outperform the LGI approach in most cases, it’s accuracy is operating in a very similar range. However, in case of fully uncontrolled scenarios with changing illumination as well as different head and face motion, as given in the LGI city talk session, no algorithm is able to perform reasonable well.
In Figure 4 and 5 the box plot statistics for the UBFC and the LGI database are visualized. In addition to Table 1 the influence of the Diffusion process incorporated into the LGI and SPH approach is constituted. Both approaches benefit from its stochastic interpretation as quasi-periodic nature of blood volume changes.
5 Discussion
We have extended the current knowledge on linear orthogonal operators in the PPGIs feature space. The resulting manifold valued representation is holding implicit properties of invariance with respect to translations of the Group acting onto the set of features. This carries major advantages over the previously prior based assumptions, both POS and LGI, since it comes with a simple change of the topology where the existence of these properties are guaranteed by the fundamental attributes of the space. The computational complexity of the new feature operator is supremely low and it’s implementation fairly easy. The comparison against the most popular representatives of this algorithmic family succeeded with a quite promising strength of prediction accuracy. Since the approach is reflecting a fully closed form solution regarding the genuine problem statement no nasty tuning of parameters is necessary, it’s operating free of parameters. The major limitation of the operator in its current form is the restriction of invariance with respect to the group of translations. Since we have presented the proof that the intensity of pixels doesn’t contribute to the periodic characteristics of skin perfusion, it follows that is indeed a question of the wavelength.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] N. Wiener. Cybernetics: Or control and communication in the animal and the machine. MIT Press , 2(1), 1948.
- 2[2] X. Li, J. Chen, G. Zhao, and M. Pietikäinen. Remote heart rate measurement from face videos under realistic situations. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH , 2014.
- 3[3] A. Osman, J. Turcot, and R. E. Kaliouby. Supervised learning approach to remote heart rate estimation from facial videos. 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition , pages 1–6, 2015.
- 4[4] S. Tulyakov, X. A. Pineda, E. Ricci, L. Yin, J. Cohn, and N. Sebe. Self-adaptive matrix completion for heart rate estimation from face videos under realistic conditions. IEEE International Conference on Computer Vision and Pattern Recognition , 2016.
- 5[5] C. S. Pilz, J. Krajewski, and V. Blazek. On the diffusion process for heart rate estimation from face videos under realistic conditions. Pattern Recognition. GCPR 2017. Lecture Notes in Computer Science, vol 10496. Springer , 10496:361–373, 2017.
- 6[6] C. S. Pilz, S Zaunseder, J. Krajewski, and V. Blazek. Local group invariance for heart rate estimation from face videos in the wild. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , pages 1254–1262, 2018.
- 7[7] W. Chen and D. Mc Duff. Deepphys: Video-based physiological measurement using convolutional attention networks. European Conference on Computer Vision (ECCV) , pages 356–373, 2018.
- 8[8] C.R. Blazek and M. Hülsbusch. Assessment of allergic skin reactions and their hemodynamical quantification using photoplethysmography imaging. Computer-aided Noninvasive Vascular Diagnostics. Vol. 3: Proc. of 11th Int. Symposium CNVD , 3:85––90, 2005.
