REsCUE: A framework for REal-time feedback on behavioral CUEs using   multimodal anomaly detection

Riku Arakawa; Hiromu Yakura

arXiv:1903.11485·cs.HC·March 28, 2019

REsCUE: A framework for REal-time feedback on behavioral CUEs using multimodal anomaly detection

Riku Arakawa, Hiromu Yakura

PDF

TL;DR

REsCUE is an unsupervised multimodal anomaly detection system that provides real-time feedback on unconscious behavioral cues to assist executive coaching and potentially other applications.

Contribution

The paper introduces REsCUE, a novel framework that uses unsupervised anomaly detection on multimodal data to identify unconscious behaviors in real-time.

Findings

01

REsCUE effectively detects behavioral cues in coaching scenarios.

02

The system provides intuitive real-time feedback to coaches.

03

REsCUE's unsupervised approach requires no prior knowledge.

Abstract

Executive coaching has been drawing more and more attention for developing corporate managers. While conversing with managers, coach practitioners are also required to understand internal states of coachees through objective observations. In this paper, we present REsCUE, an automated system to aid coach practitioners in detecting unconscious behaviors of their clients. Using an unsupervised anomaly detection algorithm applied to multimodal behavior data such as the subject's posture and gaze, REsCUE notifies behavioral cues for coaches via intuitive and interpretive feedback in real-time. Our evaluation with actual coaching scenes confirms that REsCUE provides the informative cues to understand internal states of coachees. Since REsCUE is based on the unsupervised method and does not assume any prior knowledge, further applications beside executive coaching are conceivable using our…

Tables1

Table 1. Table 1. The results of the preliminary experiment. The combination of the posture and gaze information showed the highest detection performance.

Used modalities			Metrics
Posture	Gaze	Facial	Recall	Average of
Posture	Gaze	expression	Recall	$τ$ distance
✓			$0.57 \pm 0.08$	$0.42 \pm 0.14$
	✓		$0.38 \pm 0.15$	$0.21 \pm 0.12$
		✓	$0.08 \pm 0.10$	$0.01 \pm 0.02$
✓	✓		$0.68 \pm 0.08$	$0.64 \pm 0.09$
✓		✓	$0.57 \pm 0.12$	$0.40 \pm 0.16$
	✓	✓	$0.37 \pm 0.12$	$0.20 \pm 0.12$
✓	✓	✓	$0.67 \pm 0.16$	$0.61 \pm 0.12$

Equations15

a^{(t)} = - ln Σ_{i = 0}^{l} c_{i}^{(t - 1)} N (x^{(t)} ∣ μ_{i}^{(t - 1)}, Σ_{i}^{(t - 1)})

a^{(t)} = - ln Σ_{i = 0}^{l} c_{i}^{(t - 1)} N (x^{(t)} ∣ μ_{i}^{(t - 1)}, Σ_{i}^{(t - 1)})

γ_{i}^{(t)}

γ_{i}^{(t)}

c_{i}^{(t)}

\overset{ˉ}{μ}_{i}^{(t)}

μ_{i}^{(t)}

\overset{ˉ}{Σ}_{i}^{(t)}

Σ_{i}^{(t)}

\hat{X}_{i}^{(t)} =

\hat{X}_{i}^{(t)} =

\overset{ˇ}{X}^{(t)} =

\overset{ˇ}{X}^{(t)} =

K_{min} (τ_{1}, τ_{2}) = \frac{Σ _{{i, j} \in P (τ_{1}, τ_{2})} K ˉ _{i, j} ( τ _{1} , τ _{2} )}{k \times ( k - 1 ) /2}

K_{min} (τ_{1}, τ_{2}) = \frac{Σ _{{i, j} \in P (τ_{1}, τ_{2})} K ˉ _{i, j} ( τ _{1} , τ _{2} )}{k \times ( k - 1 ) /2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

REsCUE: A framework for REal-time feedback on behavioral CUEs using multimodal anomaly detection

Riku Arakawa

University of TokyoTokyoJapan

[email protected]

and

Hiromu Yakura

Teambox Inc.TokyoJapan

[email protected]

(2019)

Abstract.

Executive coaching has been drawing more and more attention for developing corporate managers. While conversing with managers, coach practitioners are also required to understand internal states of coachees through objective observations. In this paper, we present REsCUE, an automated system to aid coach practitioners in detecting unconscious behaviors of their clients. Using an unsupervised anomaly detection algorithm applied to multimodal behavior data such as the subject’s posture and gaze, REsCUE notifies behavioral cues for coaches via intuitive and interpretive feedback in real-time. Our evaluation with actual coaching scenes confirms that REsCUE provides the informative cues to understand internal states of coachees. Since REsCUE is based on the unsupervised method and does not assume any prior knowledge, further applications beside executive coaching are conceivable using our framework.

Executive Coaching, Nonverbal behavior analysis, Multimodal interaction, Anomaly detection

††copyright: acmlicensed††doi: 10.1145/3290605.3300802††isbn: 978-1-4503-5970-2/19/05††conference: CHI Conference on Human Factors in Computing Systems Proceedings; May 4–9, 2019; Glasgow, Scotland Uk††booktitle: CHI Conference on Human Factors in Computing Systems Proceedings (CHI 2019), May 4–9, 2019, Glasgow, Scotland Uk††journalyear: 2019††price: 15.00††ccs: Human-centered computing Computer supported cooperative work††ccs: Human-centered computing HCI design and evaluation methods††ccs: Information systems Multimedia and multimodal retrieval

1. Introduction

Executive coaching plays an important role in human resource development (Kilburg, 2000; Feldman and Lankau, 2005). As a result, many companies invest in executive coaching to improve the leadership skills or the performances of their managers and the market share of executive coaching has increased to $2 billion (Hamlin et al., 2008; Fillery-Travis and Lane, 2006; Athanasopoulou and Dopson, 2018).

Executive coaching usually consists of personal, one-on-one sessions (Koonce, 1994; Stern, 2004). One-on-one sessions are preferred because coaches are required not only to build a rapport with a coachee but also to observe the nonverbal behavior of the coachee during the coaching session (Kowalski and Casper, 2007; Bloom et al., 2005). For example, the use of disorienting dilemmas is one of an important coaching process (Cox, 2015); however, sensitive conversations about a coachee’s dilemmas may cause deceptive responses (Lippard, 1988).

In such a situation, coaches are expected to notice a discrepancy between the verbal response and the actual thoughts of the coachee using nonverbal cues (Goman, 2008). Many articles list the observation skills as one of the skills required for effective coaching, in addition to emotional intelligence and questioning techniques (Grant, 2007; Ellinger et al., 2003).

However, maintaining such objective observations throughout the coaching session requires a great deal of skill (Riggio and Lee, 2007). Coaches are often immersed in the verbal communication, paying attention to a deeper or emotional topic or thinking about what to ask next. In addition, self-deception may interfere with the quality of perception, e.g., ignoring significant behavior unconsciously based on faulty thinking or irrational beliefs (Bachkirova, 2014). Therefore, we expect that the quality of coaching could be improved if the coaches are automatically notified of important nonverbal behavioral cues from the coachee independently of their subjectivity or mental load.

One possible solution is to apply conventional methods proposed in the context of human activity analysis (Aggarwal and Ryoo, 2011) or social signal processing (Vinciarelli et al., 2012). However, these methods are mainly targeted at classifying human activities into specific categories and therefore have low affinity to the current situation, i.e., each behavior may correspond to various meanings depending on its context in the coaching session (Brown and Moskowitz, 1998; Ianiro and Kauffeld, 2014). In addition, most of these methods are designed for post-analysis and are not applicable for providing real-time feedback during a session. We, therefore, propose REsCUE, a new system introducing a real-time anomaly detection method into human behavior analysis (Figure 1).

Our framework, by exploiting the anomaly detection method, does not require prior knowledge or heuristic rules and therefore leaves room for the coach’s interpretation of the semantics of the behavior based on the context. Moreover, combined with state-of-the-art behavior analysis methods, our framework is able to detect small but important behavioral cues, which might be missed by the coaches. REsCUE presents a new perspective of human behavior analysis that augments the perception of the user while leaving its interpretation to the user, which pave the way for further applications in the HCI community.

1.1. Contribution

The following four points are the main contributions of this study.

(1)

We developed an intelligent system for use in coaching sessions that can automatically detect the nonverbal behavioral cues of coachees and provide feedback to coaches in real-time. 2. (2)

Based on a preliminary analysis, we confirmed that the combination of the posture and gaze information is an effective modality for detecting nonverbal cues of coachees. 3. (3)

Our user study in actual coaching scenes demonstrates that the proposed system provides informative feedback to professional coaches and would likely improve the quality of sessions. 4. (4)

Because the proposed method is based on an unsupervised algorithm and does not assume prior knowledge, it can be applied widely outside of executive coaching applications, as a new framework for real-time behavioral analysis.

2. Literature Review

2.1. Background

The relation between nonverbal behaviors and internal states has been a distinguished topic in the history of science (Aron W. Siegman, 1978; Harper et al., 1978). Beginning with Charles Darwin (Darwin, 1872), many researchers have pointed out that nonverbal behaviors are spontaneous and unregulated expressions of internal states (DePaulo, 1992). On the contrary, the effect of nonverbal behaviors on the internal states has also been revealed, e.g., the influence of facial muscular activity on people’s affective responses (Carr et al., 2003).

Based on the relationship, observations of the nonverbal behaviors have been largely focused in various areas including executive coaching, as discussed in the “Introduction” section. For example, teachers are encouraged to pay attention to nonverbal behaviors of students, which convey their underlying feelings (King, 1999). In addition, not only teachers or therapists (Kinseth, 1989) but also salespeople (Rentz et al., 2002) or entrepreneurs (Peleckis et al., 2016) are expected to get a handle on nonverbal behavioral cues. Subsequently, a research domain of automatically analyzing nonverbal behaviors, which is often referred to as social signal processing (Vinciarelli et al., 2012), has spread.

2.2. Related Work

Many methods have been proposed to analyze human nonverbal behavior using various modalities for one-on-one sessions or group discussions in the context of both human activity analysis (Aggarwal and Ryoo, 2011) and social signal processing (Vinciarelli et al., 2012). For example, conventional methods relying on handcrafted features have been widely researched, e.g., facial expression recognition (Lyons et al., 1998; Shan et al., 2009) and posture estimation (Cucchiara et al., 2005). Conversely, due to the development of deep learning in recent years, end-to-end methods using neural networks have become popular and have shown overwhelming performance improvements. For example, Wei et al. (Wei et al., 2016) achieved state-of-the-art performance in posture estimation by introducing a convolutional neural network.

Based on such analysis technologies, many applications have been proposed (Vinciarelli et al., 2012). For example, Sanchez-Cortes et al. (Sanchez-Cortes et al., 2012) proposed a method to detect emergent leaders in a group discussion using handcrafted audio and visual features. Beyan et al. (Beyan et al., 2018) applied multiple kernel learning to similar features to predict the leadership styles of emergent leaders. Hoque et al. (Hoque et al., 2013) leveraged multimodal behavioral data to generate instructive feedback in the context of training for job interviews. Nihei et al. (Nihei et al., 2017) introduced a convolutional neural network to extract important utterances from multimodal behavioral data without relying on handcrafted features. However, these methods are designed to analyze sessions after they occur and are not formulated to provide feedback to coaches in real-time.

Based on the methods to understand human nonverbal behavior in real-time, some studies have proposed systems to provide real-time feedback on social interactions (Kurihara et al., 2007; Nguyen et al., 2012; Tausczik and Pennebaker, 2013; Tanveer et al., 2015; Damian et al., 2015; Schneider et al., 2015; Muralidhar et al., 2016). For example, Rhema (Tanveer et al., 2015) is designed to help people with public presentations by providing feedback in real-time via Google Glass based on a speaker’s volume and speaking rate. Logue (Damian et al., 2015) addressed a similar situation by providing feedback via head-mounted display based on body energy and openness calculated from hand positions. For group discussions, Tausczik et al. (Tausczik and Pennebaker, 2013) proposed a system to analyze the communication patterns of participants and to provide linguistic feedback for improving teamwork. In addition, Damian et al. (Damian et al., 2016) proposed a general framework to provide real-time feedback on predefined behavioral events represented in an XML format.

These methods are designed to provide explicit feedback based on some specific rules, such as “louder” if the speaking voice is faint or “pay attention to what others are saying” if the group dynamics are poor. However, as mentioned in the “Introduction” section, the meaning of the behavior is largely dependent on the context in a coaching session and therefore such explicit feedback would be impossible.

A similar discussion was presented in a proposal of AutoManner (Tanveer et al., 2016), a system to improve body languages in public speaking, that is “the appropriateness of the body language is largely dependent on the context of the speech—which is difficult to automatically assess.” The system overcame this problem by displaying the estimated body skeleton with its changes and distributions in the time series and leaving room for interpretation by the speaker. However, it is specifically designed for public speaking and therefore not directly applicable to coaching. Moreover, it assumes the use of post-analysis by speakers themselves and therefore is not able to provide real-time feedback.

3. Proposed Method

In this section, we first describe the requirements of the proposed method. Then, the technical details of the method are presented, including how these requirements are solved.

3.1. Requirements

To make the coaching sessions more effective using behavior analysis, the following requirements should be considered:

(1)

Unsupervised detection

As discussed in the “Introduction” section, coaches are required to maintain objective and unbiased observations of the coachees. That is, assessing the behavior of the coachees based on heuristic rules or human-annotated training data introduces certain criteria and is not appropriate. In addition, due to the dependency of the meaning of nonverbal behaviors on their context, designing effective rules or collecting reliable training data is unrealistic. Therefore, we need the proposed system to the detect behavioral cues using an unsupervised algorithm. 2. (2)

Real-time feedback

We aim to provide cues for coaches to understand the state of coachees to improve the quality of coaching sessions. Therefore, we need the proposed system to detect the cues and provide feedback in real-time, not via post-analysis of a session. 3. (3)

Intuitive and interpretive feedback

We assume that coaches will use the proposed system while conversing with coachees. Therefore, the feedback presented to the coaches needs to be intuitive so that they can interpret it at a glance. At the same time, feedback that is too abstract, such as presenting only the fact that the behavior has changed or the statistical value of how it has changed, looses its context and is difficult for the coaches to interpret even though it would take a short time to understand. Therefore, we need the proposed system to preserve both intuitiveness and interpretiveness with regard to the feedback. 4. (4)

Non-interruptive notifications

Similar to requirement (3), we need to consider how to notify the coaches of the feedback. If the notification causes an interruption, e.g., requesting an action by the coach every time, the quality of the coaching session may degrade. That is, we need the proposed system to notify in a non-interruptive manner. 5. (5)

Portable and non-interfering devices

Because coaching sessions are often held in the office of the coachee, we need the proposed system to consist of portable devices so that the coach can easily transport them. In addition, if the proposed system requires the coachee to wear devices or sensors, this may interfere with their concentration and result in an obstacle to building a rapport. Therefore, we also need the system to consist of non-interfering devices.

3.2. Overview

To address the above requirements, we designed the proposed system as shown in Figure 2. The system collects the behavioral data of the coachee via external devices and obtains multimodal feature data. Using an anomaly detection algorithm, it detects important behavioral cues and notifies the coach both visually and tactilely in real-time.

We now describe the technical specifications of the proposed system along with the rationale behind specification.

3.3. Multimodal Input

The proposed system obtains the multimodal behavioral data from the coachees to detect their behavioral cues.

As mentioned in the “Related Work” section, various types of multimodal features have been used in automated behavior analysis. For example, Nihei et al. (Nihei et al., 2017) leveraged head pose information and speech features and concluded that the combination of these features achieved the best accuracy for detecting important utterances compared to unimodal methods. Beyan et al. (Beyan et al., 2018) combined the pose and speech features with the gaze information and reported the effectiveness of the multimodal input. Hoque et al. (Hoque et al., 2013) exploited facial expressions for the training of interviewees.

Based on both these studies and the coaching skills mentioned in the “Introduction” section, we prepared three input features: body posture (including head pose), gaze direction, and facial expression. In detail, along with the fact that the body language captured from the body posture is the basis of the observation skill described in (Grant, 2007; Ellinger et al., 2003), the importance of interpreting the internal state from both the facial expression and the eye of the coachee are emphasized in (Bloom et al., 2005). Then, we selected which modalities to use in the proposed system later based on a preliminary experiment.

Here, we excluded the speech features due to requirement (4) because it is difficult to construct intuitive and interpretive feedback from changes in speech features. While presenting the frequency or the decibel of the coachee’s voice is possible, it is difficult to understand how the behavior of the coachee changed and to infer their internal state from such feedback.

In addition, it is possible to use biometric devices to directly capture the signals from the coachee’s body. However, this contradicts requirement (5) and could create a psychological barrier. Therefore, in this study, we limited the input modalities to those that are measurable without a body-mounted sensor.

3.3.1. Posture

The proposed system uses the pose estimation algorithm proposed in (Wei et al., 2016). Because the algorithm is able to estimate the pose from images taken by a web camera, motion-tracking devices, which contradict requirement (5), are not required. In addition, the algorithm can not only achieve the state-of-the-art performance, as mentioned in the “Related Work” section, but can also be processed quickly enough to provide real-time feedback, satisfying requirement (2).

In the proposed system, the 12 key points shown in Figure 3 are used to detect the behavioral cues. We excluded the key points in the lower body because the coaching sessions are usually held at a desk.

3.3.2. Gaze

The proposed system uses a commercial eye tracker Tobii 4C to detect the gaze direction. This is a USB-connected device that is capable of tracking the looking direction without being worn, satisfying requirement (5).

In the proposed system, we used the two-dimensional coordinate of the gaze position. Here because the proposed system focuses on the relative change in the detected value, not the absolute value, our system does not require a calibration step for each session.

3.3.3. Facial Expression

The proposed system uses MicroExpNet (Çugu et al., 2017) to extract the facial expressions of the coachees. This is a small and fast convolutional neural network designed for facial expression recognition, which is obtained by distilling a heavy and accurate neural network. Çugu et al. (Çugu et al., 2017) reported that the network achieved over 95.0% classification accuracy for the eight expressions of “neutral,” “anger,” “contempt,” “disgust,” “fear,” “happy,” “sadness,” and “surprise” under the real-time conditions. Therefore, we decided to use this network to meet requirement (2).

In the proposed system, we use the output value of the final layer after the softmax activation as an eight-dimensional feature vector of the facial expression in the same manner as (Vo and Le, 2016). In other words, each value of the vector represents the probability that the expression belongs to the corresponding class.

3.4. Anomaly Detection

The proposed system detects behavioral cues from the multimodal feature data obtained by the method described in the “Multimodal Input” section. To satisfy requirements (1) and (2), we used anomaly detection algorithms, which are sometimes referred to as change point detection algorithms. This is because there are many proposed unsupervised online algorithms for anomaly detection (Chandola et al., 2009).

We use SmartSifter (Yamanishi et al., 2004), an adaptive anomaly detection algorithm based on the Gaussian Mixture Model (GMM). The main reason we chose the algorithm is that it is one of the most popular unsupervised online anomaly detection algorithms available. In addition, the results obtained with this GMM-based approach are useful for designing informative feedback, as described later in this section.

Each time new input data arrive, SmartSifter estimates the data’s outlierness based on the likelihood calculated by the GMM and, at the same time, updates the parameters of the GMM to fit the input data at the same time. More formally, let $\bm{x}^{(t)}$ be an input data and $c_{i}^{(t)}$ , $\bm{\mu}_{i}^{(t)}$ , and $\bm{\Sigma}_{i}^{(t)}$ be the weight, mean, and covariance of the $i$ -th component of the $l$ -components GMM, respectively, at time $t$ . Then, the outlierness of $\bm{x}^{(t)}$ is calculated as follows.

[TABLE]

The parameters of the GMM are updated subsequently as follows.

[TABLE]

Here, $r$ represents a forgetting rate, which is related to the degree of discounting of past input data.

In the proposed system, SmartSifter is extended to use batches in the time series and therefore takes $\bm{X}^{(t)}\in\mathbb{R}^{N\times M}$ as input, where $N$ represents the number of frames in a single batch and $M$ represents the number of the dimensions of the modality data. This is because the frame-by-frame behavioral changes would include instantaneous physiological responses, which make the feedback to the coach noisy. Introducing batch processing enables the proposed system to detect the changes in the distribution of the behavior. This increases the chance of capturing relatively long-term behavioral changes, which may be more difficult to recognize for a human observer (Nook et al., 2015; Slessor et al., 2008).

Moreover, the proposed system exploits SmartSifter to obtain the interpretive feedback, fulfilling requirement (3). Given that the GMM is used for clustering of the behavioral data, each component of the obtained GMM can be regarded as a representation of a particular behavioral state of the coachee. Therefore, the proposed system can obtain the representative $l$ frames in a batch at time $t$ as follows.

[TABLE]

Conversely, the most significant outlier frame can be obtained as follows.

[TABLE]

By observing both the representative frames from the last batch and the outlier frame from the current batch, the coach can easily understand how the behavior of the coachee has changed while preserving the interpretiveness of the change.

We summarize the above in Algorithm 1. Every time new input data arrive, the outlierness is calculated in the same manner as in Equation 1. If the outlierness exceeds the given threshold, the frames obtained using Equation 3.4, which show the representative states so far, and the current outlier frame obtained using Equation 3.4 are presented to the coach as the feedback. Then, the parameters of the GMM are updated in the same way as in Equation 3.4.

3.5. Feedback

To satisfy requirement (4), we make use of tactile feedback to notify the coaches when behavioral cues are detected. In particular, the proposed system provides coaches with a smartwatch, which vibrates on detection of a behavioral cue. Tactile feedback is used because it does not interfere with the performance of concurrent tasks while obtaining high notice rates (E. Sklar and Sarter, 2000). Moreover, the capability of tactile feedback during social interactions has been confirmed (Damian and André, 2016).

At the same time, representative frames from past scenes and the outlier frame from the current scene are displayed to coaches on detection of a behavioral cue. This allows coaches to easily understand how the behavior has changed, and this information is used to further infer and analyze the internal state of the coachee. In other words, this visualization achieves both the intuitiveness and interpretiveness as stipulated by requirement (3), because it enables the coaches to grasp feedback at a glance while making the feedback informative.

Moreover, because the scale of body movement patterns varies between individuals (Ekman et al., 1980), it is better to provide the coaches with the capability to adjust the detection threshold during the sessions. Therefore, we placed “more” and “less” buttons on the smartwatch to change the threshold $a_{th}$ in Algorithm 1. In this way, the coaches are able to control the frequency of notifications according to each coachee.

In summary, combining the tactile notification via a smartwatch with intuitive and interpretive visual information, coaches are freed from the burden of paying their attention to displays in parallel to conversing with the coachees. Moreover, the coaches can easily adjust the sensitivity of the detection on the smartwatch depending on the characteristics of each coachee. Therefore, the coaches can exploit feedback without losing concentration and the better coaching performances are expected.

4. Preliminary Experiment

To determine which modality to use in the proposed system, we conducted a preliminary experiment. In this section, we describe the procedure and results of the experiment.

4.1. Data Collection

The experiment involved three professional coaches (aged 25–39 years old), who participated voluntarily. Each coach had coaching sessions with two different coachees for at least 30 minutes (4h 28m 35s in total). All participants, including both the coaches and the coachees, agreed to the use of the collected data for the research purposes.

During the sessions, the behavior of the coachees was recorded with a video camera and a Tobii eye tracker. After each session, the participating coaches are asked to watch the recorded video and list the top 10 most important behavioral cues of the coachee to infer their internal states.

4.2. Implementation

We implemented the proposed algorithm and applied it to the recorded data. Based on our empirical observations, the number of components and the forgetting rate in Algorithm 1 were set to 2 and 0.1, respectively. In detail, we found that, as the number of components increased, not only did it take more time until the model’s initial convergence, but also it became more difficult for the coach to interpret the detected cues since a larger number of frames were displayed simultaneously. To compare the detected results with the behavioral cues pointed out by the coaches, we had the system to output the top 10 most significant peaks in the outlierness instead of specifying the threshold. Here, the first three minutes of each session are excluded for the detection because it takes several minutes for the parameters of the GMM to converge, as shown in Figure 4.

In addition, we had the proposed system to obtain new behavioral data on every 0.5 seconds and combine them into 30-seconds batches to ensure that the same processing performance was reproducible in a portable computing environment, e.g., a regular laptop with a single GPU, as stipulated in requirement (5). This is also expected to make the feedback less noisy, as discussed in the “Anomaly Detection” section.

4.3. Evaluation Metrics

To compare the effectiveness of each modality, we chose the recall and the minimizing Kendall’s $\tau$ distance (Fagin et al., 2003) as evaluation metrics. The recall represents how many behavioral cues, which are pointed out by the coaches, are captured by the proposed system. In this case, we set the error tolerance to 30 seconds. This is because, in addition to the constraint of the batch size, some behavioral cues, such as stretching or scratching one’s head, take more than several seconds making it difficult to specify their precise timing.

The minimizing Kendall’s $\tau$ distance is a metric to measure the similarity between two top $k$ lists $(k\geq 2)$ and is widely used in the evaluation of search engines (Haveliwala, 2003; Ziegler et al., 2005). Given the two lists $\tau_{1}$ and $\tau_{2}$ , the distance is defined as follows.

[TABLE]

Here, $P\left(\tau_{1},\tau_{2}\right)$ denotes the set of all unordered pairs of distinct elements in $\tau_{1}\cup\tau_{2}$ . Then, $\bar{K}_{i,j}\left(\tau_{1},\tau_{2}\right)=1$ if (i) $i$ appears only in one list and $j$ appears only in the other list, (ii) $i\prec j$ in one list and only $j$ appears in the other list, or (iii) $i\prec j$ in one list and $i\succ j$ in the other list; otherwise, $\bar{K}_{i,j}\left(\tau_{1},\tau_{2}\right)=0$ . Consequently, if $\tau_{1}$ and $\tau_{2}$ are identical, $K_{min}\left(\tau_{1},\tau_{2}\right)=0$ .

4.4. Results

The results are shown in Table 1. From the comparison of the recall and the minimizing Kendall’s $\tau$ distance, we confirmed that the combination of the posture and gaze information is the most suitable for detecting behavioral cues in coaching sessions. In addition, as discussed in the “Multimodal Input” section, the experiment demonstrated the effectiveness of the multimodal features compared to the unimodal features by observing the difference with cases of only the posture or the gaze was used.

The facial expression information, however, did not contribute to improvements in the detection performance. Here, the chance rate of the recall is $0.11$ , meaning that, if we choose the cue points randomly, 1 of the 10 chosen points is considered correct on average. However, the recall of the case using only the facial expression is lower than that.

4.5. Analysis

In this subsection, we discuss the reasons behind the results in Table 1 by analyzing the detected cues according to each modality.

4.5.1. Why was the combination of the posture and gaze information effective?

In the recorded data, based on only the posture information, the proposed system detected a wide range of behavioral cues ranging from leaning on a chair to putting a hand on one’s hip111Putting a hand on one’s hip is considered to represent a defensive state (Zhang and Yap, 2012). or placing a hand on the back of one’s neck222Placing a hand on the back of one’s neck is considered to represent an aggressive state (Grant, 1969)., as shown in Figure 5. Likewise, the gaze information detected not only changes in the looking direction but also self-touch cues (Butzen et al., 2005) such as rubbing one’s eyebrow333Rubbing an eyebrow is considered to represent an anxious state (Butzen et al., 2005).. This is because self-touch cues often interfere with the detection of the eyes and be captured by the anomaly detection algorithm. Therefore, the effectiveness of the combination of the posture and gaze information can be attributed to the capability of the system to detect a variety of the behavioral cues.

4.5.2. Why was the facial expression ineffective?

As previously stated in the “Results” section, the facial expression did not improve the accuracy of detecting behavioral cues in the sessions. There appear to be two reasons for this result. First, facial expressions are obvious and superficial; therefore, the coaches do not regard them as important behavioral cues reflecting the internal states of coachees. Second, even though the neural network used in the experiment is state-of-the-art, the accuracy might not be sufficiently high. This is attributable to the fact that MicroExpNet is designed not for faces in free conversation but for posed faces. Therefore, for example, a face with an open mouth may be classified as fearful even though the coachee is just talking.

4.5.3. What was the difference between the coaches and the proposed system?

From the results, at least 30% of the detected points were not included in the behavioral cues listed by the coaches. When we showed such points to the participating coaches, the points were roughly divided into two groups based on their responses. The first was a group of obvious and non-informative points such as opening one’s notebook or sneezing. It is because the proposed method uses an unsupervised algorithm and cannot take the meanings of the detected points into consideration. This result suggests the importance of designing non-interruptive notification, as in requirement (4), so that the coach can easily ignore feedback when it is non-informative.

The other group included points that the coaches agreed were informative. One participant said: “Although I did not notice this when I watched the recorded video, once the system pointed out that the coachee had bent slightly forward, I could see that he was opening his mind from about that moment.” This comment suggests that the proposed system could contribute to improving the quality of coaching sessions by providing the feedback in real-time.

5. User Study

To confirm the effectiveness of the proposed system, we conducted a user study. In this section, we describe its setting and results.

5.1. Implementation

We implemented a complete version of the proposed system to perform a user study. The same parameters as the “Data Collection” section were used except for the detection threshold. The system uses the outlierness of the first peak after three minutes from the beginning of each session as the initial threshold and allows coaches to adjust the threshold subsequently via their smartwatches, as mentioned in the “Feedback” section.

The feedback indicating the behavioral cues is presented in Figure 6. It displays the representative frames from past scenes and outlier frames from current scenes so that the coach can understand how the behavior of the coachee has changed. At the same time, the coach’s smartwatch vibrates and shows buttons to adjust the threshold as shown in Figure 7. Pressing the “more” button decreases the threshold while pressing the “less” button increases the threshold.

5.2. Procedure

In the user study, five professional coaches (aged 25–39 years old), including the three coaches from the preliminary experiment, participated voluntarily. Each coach had coaching sessions with three different coachees (15 sessions in total) using the proposed system, as shown in Figure 8.

Then, we conducted short interviews with the participating coaches. In this interview, we asked for their subjective opinions concerning the usability of the proposed system. We also asked whether the given feedback effectively improved the quality of the sessions. If the participant agreed on the effectiveness, we asked how the sessions changed due to the feedback.

5.3. Comments

We found all of the participating coaches responded positively about the proposed system in the subjective interview. Here, we separately examine the obtained comments concerning the usability and the effectiveness.

5.3.1. The usability of the proposed system

When we asked about the usability of the proposed system, the replies were affirmative, such as “There was nothing confusing or difficult to understand.” and “It was so easy to use that I can imagine that I am using it from tomorrow.”

More specifically, concerning the visualization of the feedback, one participant responded,

“Putting the past frames side by side makes the changes in the behavior obvious.”

Another participant commented on the comparison with the explicit feedback:

“Simpler feedback such as just showing ‘defensive’ or ‘opening one’s heart’ could also be easy to understand. However, if it contradicts my assumptions, I could get confused and might ignore the feedback. In that respect, this system passes the initiative to me and does not cause such confusion while reminding me of other possibilities.”

Concerning the tactile notification, one participant responded,

“I think the notification is very good because it does not break my concentration and it is not noticed by the subject.”

From a different perspective, another participant commented on the benefit of the tactile notification:

“When having sessions with about seven or eight clients a day, I sometimes feel out of it. At such a time, the tactile feedback would help me focus on the sessions.”

The above comments support the usability of the proposed system, as well as the suitability of the feedback design.

At the same time, one participant gave us suggestions for future improvements. He suggested visualizing the trends of the behavioral cues throughout the session, or throughout multiple sessions of the same coachee:

“Further inferences are possible if this shows that the coachee repeats similar behaviors or that the trend in the behavioral cues changes depending on the topic of conversation.”

This can be accomplished by applying a clustering algorithm in an unsupervised manner. For example, the algorithm enables the similarity with past scenes to be represented by visualizing the cue that each cluster belongs to in a time series.

Moreover, this could lead the proposed system to determine the non-informative behavioral cues. In particular, by adding an “obvious” button to the smartwatch in the same manner as shown in Figure 7, clusters of the non-informative cues could be identified in an interactive manner. In this way, the presence of obvious and non-informative cues, which were discussed in the “What was the difference between the coaches and the proposed system?” section, can be reduced. Therefore, we would like to implement this feature in the near future.

5.3.2. The effectiveness of the proposed system to improve the quality of the sessions

The participating coaches also commented positively on the informativeness of the detected behavioral cues and the effectiveness of the proposed system in the sessions, for example:

“Although I often immerse myself in the conversation, thanks to this system, I was able to pay attention to the behavior of the client.”

and

“This system made me realize that I unconsciously missed or ignored many important behavioral cues.”

Other comments confirmed that the proposed system helped the coaches change the content of the sessions in accordance with the state of the coachees:

“I had been convinced that the coachee was agreeing to my proposal, but from the given feedback, I noticed that it didn’t seem true. So, I was able to make a decision to explain my proposal more carefully until he was satisfied.”

“I was impressed when the smartwatch vibrated immediately after I asked a delving question having butterflies in my stomach. From the feedback, I became convinced that the underlying cause of the current issue lies there, and succeeded in having a deep discussion in a short period of time.”

In addition, one participant commented on the educational aspect of the proposed system:

“Up to this time, to cultivate observational skills, we had to review the recordings of the sessions of ourselves or observe the sessions by other coaches. However, using this system, it would be possible to learn what sort of behavioral cues should be focused on during a session and reduce the training time.”

The above comments suggest that the proposed system may effectively improve the quality of the coaching sessions.

6. Discussion

Although the proposed system was generally appreciated in the “User Study” section, there is still room for further exploration. In this section, we discuss the limitations and future directions of our research.

6.1. Limitations

Throughout the preliminary study, the combination of the posture and gaze information was confirmed to be the most effective and thus chosen as the input modalities in the following user study. Nonetheless, the number of the participants was relatively small to rule out other possibilities. Additional investigations with other available modalities are desirable to seek for potential combinations.

Also, though the effectiveness of the proposed system is qualitatively supported by the comments from the participated coaches in the user study, a quantitative comparison of the outcome of the coaching sessions with controlled groups is preferred so as to avoid subject biases. However, the impact of executive coaching is shaped by a variety of factors such as its purpose, length, organizational context, and individual differences (Joo, 2005) and evaluating its outcome via randomized controlled experiments is costly (Athanasopoulou and Dopson, 2018). One possible remedy is to expand the preliminary experiment to support the results from the other aspects, e.g., collecting self-labelled ground-truth data of the internal states from coachees to validate whether the change of their internal states is captured using the proposed system.

6.2. Future Directions

Throughout the user study, the effectiveness of the real-time feedback is confirmed. In particular, changing the direction of the session on the spot based on the detected cues is not achievable using the conventional methods designed for post-analysis, as we mentioned in the “Related Work” section.

On the other hand, though the design of the proposed feedback system is based on the rationale presented in the “Requirements” section, there are other possibilities like those proposed by previous studies. We would like to explore a better interaction with the coaches such as comparing different ways of presenting cues, or suppressing non-informative notifications using a clustering method, which is discussed in the “The usability of the proposed system” section.

Exploring cases of further use also remains a promising endeavor. Since the proposed method consists of unsupervised learning and does not require any prior knowledge or rules, it could be used to analyze the behavior of people outside coaching sessions. For example, REsCUE might be able to assist people working in dementia care, where it is necessary to analyze the behavior of a patient and consider therapeutic approaches (Brooker, 2003). Moreover, the connection between conversation and structural neural connectivity in children has been elucidated recently (Romeo et al., 2018), and thus REsCUE would potentially be utilized in early childhood education as well.

7. Conclusions

In this study, we introduced REsCUE, an intelligent system for use in coaching sessions that can automatically detect nonverbal behavioral cues of coachees and provide feedback to coaches in real-time. Based on a preliminary experiment, the posture and gaze information proved to be effective modalities to detect behavioral cues. In actual sessions with professional coaches, a number of favorable comments were obtained, indicating that REsCUE can help coaches to maintain a conversation with coachees while simultaneously inferring their internal states. For future work, we will investigate other applications of REsCUE by exploiting that the proposed method is based on the unsupervised algorithm and does not depend on prior information.

Bibliography65

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Aggarwal and Ryoo (2011) Jagdishkumar Aggarwal and Michael S. Ryoo. 2011. Human activity analysis: A review. Comput. Surveys 43, 3 (2011), 16:1–16:43. https://doi.org/10.1145/1922649.1922653 · doi ↗
3Aron W. Siegman (1978) Stanley Feldstein Aron W. Siegman (Ed.). 1978. Nonverbal Behavior and Communication . Lawrence Erlbaum Associates, Hillsdale, NJ.
4Athanasopoulou and Dopson (2018) Andromachi Athanasopoulou and Sue Dopson. 2018. A systematic review of executive coaching outcomes: Is it the journey or the destination that matters the most? The Leadership Quarterly 29, 1 (2018), 70–88. https://doi.org/10.1016/j.leaqua.2017.11.004 · doi ↗
5Bachkirova (2014) Tatiana Bachkirova. 2014. Psychological development in adulthood and coaching. In The Complete Handbook of Coaching (2 ed.), Elaine Cox, Tatiana Bachkirova, and David Clutterbuck (Eds.). SAGE Publications, London.
6Beyan et al . (2018) Cigdem Beyan, Francesca Capozzi, Cristina Becchio, and Vittorio Murino. 2018. Prediction of the Leadership Style of an Emergent Leader Using Audio and Visual Nonverbal Features. IEEE Transaction on Multimedia 20, 2 (2018), 441–456. https://doi.org/10.1109/TMM.2017.2740062 · doi ↗
7Bloom et al . (2005) Gary S. Bloom, Claire L. Castagna, Ellen Moir, and Betsy Warren (Eds.). 2005. Blended Coaching: Skills and Strategies to Support Principal Development . Corwin Press, Thousand Oaks, CA.
8Brooker (2003) Dawn Brooker. 2003. What is person-centred care in dementia? Reviews in Clinical Gerontology 13, 3 (2003), 215–222. https://doi.org/10.1017/S 095925980400108 X · doi ↗