AVEC 2019 Workshop and Challenge: State-of-Mind, Detecting Depression   with AI, and Cross-Cultural Affect Recognition

Fabien Ringeval; Bj\"orn Schuller; Michel Valstar; NIcholas Cummins,; Roddy Cowie; Leili Tavabi; Maximilian Schmitt; Sina Alisamir; Shahin; Amiriparian; Eva-Maria Messner; Siyang Song; Shuo Liu; Ziping Zhao; Adria; Mallol-Ragolta; Zhao Ren; Mohammad Soleymani; Maja Pantic

arXiv:1907.11510·cs.HC·July 29, 2019

AVEC 2019 Workshop and Challenge: State-of-Mind, Detecting Depression with AI, and Cross-Cultural Affect Recognition

Fabien Ringeval, Bj\"orn Schuller, Michel Valstar, NIcholas Cummins,, Roddy Cowie, Leili Tavabi, Maximilian Schmitt, Sina Alisamir, Shahin, Amiriparian, Eva-Maria Messner, Siyang Song, Shuo Liu, Ziping Zhao, Adria, Mallol-Ragolta, Zhao Ren, Mohammad Soleymani, Maja Pantic

PDF

TL;DR

The AVEC 2019 Challenge provided a standardized benchmark for multimodal health and emotion recognition, focusing on depression detection, state-of-mind, and cross-cultural affect analysis using audiovisual data.

Contribution

This paper introduces new tasks, guidelines, and baseline results for AVEC 2019, advancing multimodal emotion and health recognition research.

Findings

01

Baseline systems achieved measurable performance on all three tasks.

02

The challenge fostered comparison of different multimedia processing approaches.

03

New datasets and evaluation protocols were established for future research.

Abstract

The Audio/Visual Emotion Challenge and Workshop (AVEC 2019) "State-of-Mind, Detecting Depression with AI, and Cross-cultural Affect Recognition" is the ninth competition event aimed at the comparison of multimedia processing and machine learning methods for automatic audiovisual health and emotion analysis, with all participants competing strictly under the same conditions. The goal of the Challenge is to provide a common benchmark test set for multimodal information processing and to bring together the health and emotion recognition communities, as well as the audiovisual processing communities, to compare the relative merits of various approaches to health and emotion recognition from real-life data. This paper presents the major novelties introduced this year, the challenge guidelines, the data used, and the performance of the baseline systems on the three proposed tasks:…

Tables7

Table 1. Table 1. Number of subjects and duration of the storytellings contained in the USoM database (Rathner et al . , 2018b ) .

Partition	# Subjects	Duration [h:min:s]
Training	45	13:49:38
Development	33	10:46:57
Test	33	9:46:14
All	111	34:22:49

Table 2. Table 2. Number of subjects and duration of the interviews included in the Extended-DAIC database (Gratch et al . , 2014 ) .

Partition	# Subjects	Duration [h:min:s]
Training	163	43:30:20
Development	56	14:47:31
Test	56	14:52:42
All	275	73:10:33

Table 3. Table 3. Number of subjects and duration of the video chats contained in the SEWA database (Kossaifi et al . , 2019 ) .

Culture	Partition	# Subjects	Duration [h:min:s]
German	Training	34	1:33:12
German	Devel.	14	0:37:46
German	Test	16	0:46:38
Hungarian	Training	34	1:08:24
Hungarian	Devel.	14	0:28:42
Hungarian	Test	18	0:36:06
Chinese	Test	70	3:17:52
All		200	8:28:40

Table 4. Table 4. Baseline results evaluated with C C C 𝐶 𝐶 𝐶 CCC for the AVEC 2019 SoMS; USoM data set (Rathner et al . , 2018b ) ; BoAW-M/e: bags-of-audio-words with MFCCs/eGeMAPS; DS-DNet: Deep Spectrum using DenseNet-121; DS-VGG: Deep Spectrum using VGG-16; best result on the test partition is highlighted in bold.

	Audio						Video				Fusion
Partition	MFCCs	eGeMAPS	BoAW-M	BoAW-e	DS-DNet	DS-VGG	FAUs	BoVW	ResNet	VGG	All
	Random sampling of training instances
Development	.282	.412	.336	.295	.280	.384	.372	.317	.261	.318	.417
Test	–	.276	–	–	–	.289	.119	–	–	.191	.278
	Curriculum sampling of training instances
Development	.299	.378	.334	.288	.326	.437	.419	.313	.300	.318	.464
Test	–	.294	–	–	–	.208	.151	–	-	.160	.236

Table 5. Table 5. Comparison of the approaches – training or testing on a static or dynamic measure of mood – used for the AVEC 2019 SoMS; averaged C C C 𝐶 𝐶 𝐶 CCC results are reported; [ μ ( σ ) ] delimited-[] 𝜇 𝜎 [\mu(\sigma)] .

Partition	Static training	Dynamic training
	Static evaluation
Development	.149 (.108)	.335 (.050)
Test	.037 (.063)	.219 (.068)
	Dynamic evaluation
Development	.368 (.150)	.102 (.066)
Test	.325 (.052)	.040 (.094)

Table 6. Table 6. Baseline results evaluated with C C C 𝐶 𝐶 𝐶 CCC for the AVEC 2019 DDS; R M S E 𝑅 𝑀 𝑆 𝐸 RMSE is additionally reported; BoAW-M/e: bags-of-audio-words with MFCCs/eGeMAPS; DS-DNet: Deep Spectrum using DenseNet-121; DS-VGG: Deep Spectrum using VGG-16; best result on the test partition is highlighted in bold.

	Audio						Video				Fusion
Partition	MFCCs	eGeMAPS	BoAW-M	BoAW-e	DS-DNet	DS-VGG	FAUs	BoVW	ResNet	VGG	All
	Regression of PHQ-8 score ( $C C C$ )
Development	.198	.076	.102	.272	.165	.305	.115	.107	.269	.108	.336
Test	–	–	–	.045	–	.108	.019	–	.120	–	.111
	Regression of PHQ-8 score ( $R M S E$ )
Development	7.28	7.78	6.32	6.43	8.09	8.00	7.02	5.99	7.72	7.69	5.03
Test	–	–	–	8.19	–	9.33	10.0	–	8.01	–	6.37

Table 7. Table 7. Baseline results evaluated with C C C 𝐶 𝐶 𝐶 CCC for the AVEC 2019 CES; SEWA dataset (Kossaifi et al . , 2019 ) ; DeepSpec: Deep Spectrum; best result on the test partition is highlighted in bold.

		Audio					Video				Fusion
Culture	Partition	MFCCs	eGeMAPS	BoAW-M	BoAW-e	DS	FAUs	BoVW	ResNet	VGG	All
		Arousal
German	Dev.	.389	.396	.323	.434	.380	.606	.556	.475	.561	.629
German	Test	–	.293	–	.276	–	.562	–	–	.505	.517
Hungarian	Dev.	.236	.305	.237	.291	.156	.425	.321	.460	.367	.583
Hungarian	Test	–	.272	–	.250	–	.527	–	–	.396	.525
Ger. + Hun.	Dev.	.326	.371	.298	.398	.312	.531	.467	.473	.493	.614
Chinese	Test	–	.100	–	.107	–	.355	–	–	.297	.238
		Valence
German	Dev.	.344	.405	.190	.455	.317	.639	.594	.552	.595	.684
German	Test	–	.309	–	.325	–	.627	–	–	.646	.622
Hungarian	Dev.	.017	.073	.042	.135	.084	.463	.421	.373	.363	.508
Hungarian	Test	–	.166	–	.151	.173	.459	–	–	.548	.397
Ger. + Hun.	Dev.	.187	.286	.134	.352	.233	.565	.523	.487	.505	.615
Chinese	Test	.–	.267	–	.281	–	.468	–	–	.398	.423
		Liking
German	Dev.	.159	.136	.140	.003	.164	.056	.073	.057	.244	.048
German	Test	–	.012	–	.074	–	-.042	–	–	-.052	-.019
Hungarian	Dev.	.115	.192	-.027	.253	.121	.104	.041	.028	.028	.260
Hungarian	Test	–	.051	–	.089	–	-.062	–	–	-.069	-.22
Ger. + Hun.	Dev.	.144	.159	.074	.138	.142	.083	.057	.040	.037	.222
Chinese	Test	–	.007	–	.041	–	.006	–	–	-.006	-.012

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

AVEC 2019 Workshop and Challenge: State-of-Mind, Detecting Depression with AI, and Cross-Cultural Affect Recognition

Fabien Ringeval

0000-0002-9213-4529

Université Grenoble Alpes, CNRSGrenobleFrance

,

Björn Schuller

University of AugsburgAugsburgGermany

,

Michel Valstar

University of NottinghamNottinghamUK

,

Nicholas Cummins

University of AugsburgAugsburgGermany

,

Roddy Cowie

Queen’s University BelfastBelfastUK

,

Leili Tavabi

University of Southern CaliforniaLos AngelesUSA

,

Maximilian Schmitt

University of AugsburgAugsburgGermany

,

Sina Alisamir

Université Grenoble Alpes, CNRSGrenobleFrance

,

Shahin Amiriparian

University of AugsburgAugsburgGermany

,

Eva-Maria Messner

University of UlmUlmGermany

,

Siyang Song

University of NottinghamNottinghamUK

,

Shuo Liu

University of AugsburgAugsburgGermany

,

Ziping Zhao

Tianjin Normal UniversityTianjinChina

,

Adria Mallol-Ragolta

University of AugsburgAugsburgGermany

,

Zhao Ren

University of AugsburgAugsburgGermany

,

Mohammad Soleymani

University of Southern CaliforniaLos AngelesUSA

and

Maja Pantic

Imperial College LondonLondonUK

(2019)

Abstract.

The Audio/Visual Emotion Challenge and Workshop (AVEC 2019) “State-of-Mind, Detecting Depression with AI, and Cross-cultural Affect Recognition” is the ninth competition event aimed at the comparison of multimedia processing and machine learning methods for automatic audiovisual health and emotion analysis, with all participants competing strictly under the same conditions. The goal of the Challenge is to provide a common benchmark test set for multimodal information processing and to bring together the health and emotion recognition communities, as well as the audiovisual processing communities, to compare the relative merits of various approaches to health and emotion recognition from real-life data. This paper presents the major novelties introduced this year, the challenge guidelines, the data used, and the performance of the baseline systems on the three proposed tasks: state-of-mind recognition, depression assessment with AI, and cross-cultural affect sensing, respectively.

Affective Computing; State-of-Mind; Cross-Cultural Emotion

††copyright: rightsretained††doi: 10.475/123_4††isbn: 123-4567-24-567/08/06††conference: ACM Multimedia conference; October 2019; Nice, France††journalyear: 2019††price: 15.00††copyright: acmcopyright††conference: 2019 Audio/Visual Emotion Challenge and Workshop; October, 2019; Nice, France††booktitle: 2019 Audio/Visual Emotion Challenge and Workshop (AVEC’19), October, 2019, Nice, France††doi: 10.1145/3266302.3266316††isbn: 978-1-4503-5983-2/18/10††ccs: General and reference Performance

1. Introduction

The Audio/Visual Emotion Challenge and Workshop (AVEC 2019) is the ninth competition aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual, and audiovisual health and emotion sensing, with all participants competing strictly under the same conditions (Schuller et al., 2011, 2012; Valstar et al., 2013, 2014; Ringeval et al., 2015; Valstar et al., 2016a; Ringeval et al., 2017a, 2018b).

One of the goals of the AVEC series is to bring together multiple communities from different disciplines, in particular, the audiovisual multimedia communities and those in the psychological and social sciences who study expressive behaviour. Another objective is to advance health and emotion recognition systems by providing a common benchmark test set for multimodal information processing, in order to compare the relative merits of the approaches to automatic health and emotion analysis under well-defined conditions, i. e. , with large volumes of un-segmented, non-prototypical and non-preselected data of wholly naturalistic behaviour. This is precisely the type of data that the new generation of affect-oriented multimedia and human-machine/human-robot communication interfaces have to face in the real world.

Major novelties are introduced for the AVEC 2019 with three separated Sub-challenges focusing on health and emotion analysis: (i) State-of-Mind Sub-challenge (SoMS), (ii) Detecting Depression with AI Sub-challenge (DDS), and (iii) Cross-cultural Emotion Sub-challenge (CES). In the following, we describe the novelties introduced in the Challenge and the guidelines for participating.

The State-of-Mind Sub-challenge (SoMS) is a new task focusing on the continuous adaptation of human state-of-mind (SOM), which is pivotal for mental functioning and behaviour regulation (Houben et al., 2015). SOM is constantly shifting due to internal and external stimuli, and frequent use of either adaptive or maladaptive SOM influences our mental health. One key aspect of the human experience is the way emotion features in our SOM (Shapiro and MacInnis, 2002; Schwarz and Clore, 1983). In the SoMS, self-reported mood (10-point Likert scale) after the narrative of personal stories (two positive and two negative), has to be predicted automatically from the audiovisual recordings of those stories; USoM corpus (Rathner et al., 2018b).

The Detecting Depression with AI Sub-challenge (DDS) is a major extension of the AVEC 2016 DSC (Valstar et al., 2016b), where the level of depression severity (PHQ-8 questionnaire) was assessed from audiovisual recordings of patients interacting with a virtual agent conducting a clinical interview and driven by a human as a Wizard-of-Oz (WoZ); DAIC-WOZ corpus (Gratch et al., 2014). The DAIC data set contains new recordings of the same population with the virtual agent being, this time, wholly driven by AI, i. e. , without any human intervention. Those new recordings are used as a test partition for the DDS, and will help to understand how the absence of a human conducting the virtual agent impacts on automatic depression severity assessment.

The Cross-cultural Emotion Sub-challenge (CES) is a large extension of the AVEC 2018 CES (Ringeval et al., 2018a), where dimensions of emotion were inferred from audiovisual recordings collected “in-the-wild”, i. e. , with standard webcams and at home/work place. A cross-cultural setup was further exploited for inferring emotion: knowledge of German culture was leveraged to infer emotion on the Hungarian culture, using the SEWA corpus (Kossaifi et al., 2019). This dataset now includes data collected from new participants with Chinese culture, which is used as a test set for the CES, whose aim is, therefore, to investigate how emotion knowledge of Western European cultures (German, Hungarian) can be transferred to the Chinese culture.

All Sub-challenges allow contributors to find their own features to use with their own machine learning algorithm. In addition, standard feature sets are provided for audio and video data (cf. Section 4), along with scripts available in a public repository111https://github.com/AudioVisualEmotionChallenge/AVEC2019, which participants are free to use for reproducing both the baseline features and recognition systems (cf. Section 5). The labels of the test partition remain unknown to the participants, and participants have to stick to the definition of training, development, and test partition. They may freely report on results obtained on the development partition, but are limited to five trials per Sub-challenge in submitting their results on the test partition.

Ranking of the labels relies on the Concordance Correlation Coefficient ( $CCC$ ) (Li, 1989) for all Sub-challenges; the Root Mean Squared Error ( $RMSE$ ) is additionally reported. Whereas many other metrics of performance could be exploited for ranking the contributions, such as the Spearman’s $CC$ , or the coefficient of determination ( $r^{2}$ ), we believe that the index of reproducibility $CCC$ is the most suitable metric to use, as it is not biased by changes in scale and location, and elegantly includes information on both precision and accuracy in a single statistical measure (Li, 1989). Moreover, its theoretical definition and properties are well rooted in the literature (Pandit and Schuller, 2019), and it can be easily exploited as a loss function for training neural networks (Weninger et al., 2016).

To be eligible to participate in the Challenge, every entry has to be accompanied by a paper submitted to the AVEC 2019 Data Challenge and Workshop, describing the results and the methods that created them. These papers undergo peer-review by the technical program committee. Only contributions with a relevant accepted paper and at least a submission of test results are eligible for participation. The organisers do not participate in the Challenge themselves, but re-evaluate the findings of the best performing system of each Sub-challenge.

The remainder of this paper is organised as follows. We summarise relevant related work in Section 2, introduce the Challenge corpora in Section 3, the common audiovisual baseline feature sets in Section 4, and the developed baseline recognition systems with the obtained results in Section 5, before concluding in Section 6.

2. Related Work

This section is a summary of the current state-of-the-art in the automatic analysis of affect with a focus on: (i) human state-of-mind, (ii) depression assessment in the context of AI-driven virtual agents, and (iii) dimensional analysis in cross-cultural paradigms.

2.1. State-of-Mind

The concept of a human SOM describes the phenomenon that our consciousness and emotions are constantly fluctuating over time; this is due to internal and external biological, psychological, and social demands (Houben et al., 2015; Rathner et al., 2018b). One key aspect of SOM is our emotions. They provide valuable information that influences our basic human processes in a bidirectional manner (Shan et al., 2009; Shapiro and MacInnis, 2002). Such processes include attention, perception, cognition, memory retrieval, memory storage, and behaviour regulation. In fact, depending on our actual SOM, some emotions, cognitions, and behaviours are more likely to occur, while others may be suppressed. This effect is the underlying principle of mood congruence (Schwarz and Clore, 1983; Russell, 2003).

Despite the major impact of SOM on health and social functioning, the quantification of current emotional states, with therapy contexts, has its pitfalls. The simplest of these is that it relies heavily on self-reports of emotional states, which are inherently biased (Yannakakis et al., 2017). As humans are structurally determined closed systems, it is not always sound to assume that people who give the same scores on measurement scales are actually in the same SOM (Maturana and Varela, 1987). Moreover, even within a person, the current rating of the emotional state is rooted in previous experiences, known as the adaption level, and therefore is not really accurate in an absolute way (Russell and Lanius, 1984).

Approaches like Russell’s avoid having to limit the quantification to a given language (Russell, 2003): on his theory of core affect, every instance of emotion can be quantified on the orthogonal axes arousal (from sleepy to hyper-aroused) and valence (with the poles negative and positive). However, raters’ ability to quantify reliably is still doubtful. One approach to overcoming this is to treat emotional state values as ordinal variables (Yannakakis et al., 2017). Another is to complement self-ratings with expert ratings or physiological recordings. Each of these methods has its limitations; the mismatch between different emotion assessments is still very much a matter of scientific discourse (Schwerdtfeger and Rathner, 2016; Schwerdtfeger, 2004).

Despite the given limitations of the scientific assessment of emotional states, humans constantly monitor their own and others’ emotions and organise themselves within social systems (Sapolsky, 2004; Dautenhahn, 2002). Given the need for humans to socially interact and the increased occurrence of human-machine-interactions, the development of a real-time SOM data-driven recognition system has the potential to enhance user experience, user satisfaction, and subsequently to foster user adherence (Rathner et al., 2018b, a; Baumel and Yom-Tov, 2018). Such a system could assist society in various ways; i) decreasing bias in the monitoring of SOM; ii) collecting more objective data to aid the diagnosis of affective disorders; iii) delivering tailored interventions to facilitate treatment of disease; iv) reducing the time spent in the evaluation of treatment outcome, and in e-treatment by presenting SOM related content, easing burdens on both patient and provider (Rathner et al., 2018a, b; Schuller et al., 2018; Stappen et al., 2019).

2.2. Depression Detection with AI

Depression, particularly major depressive disorder (MDD), is a common mental health problem, with negative impacts on the way one thinks, feels, and acts (Association, 2013). It can lead to a variety of emotional and physical problems and affect many aspects of both working and personal life. The World Health Organisation (WHO) declared depression as the leading cause of ill health and disability worldwide in 2015: more than 300 million people live with it (Organization, 2017). Given the high prevalence of depression and its suicide risk, finding new methods for diagnosis and treatment becomes more and more critical.

There is growing interest in using automatic human behaviour analysis for computer-aided depression diagnosis based on behavioural cues such as facial expressions and speech prosody, because of convincing evidences that depression and related mental health disorders are associated with changes in patterns of behaviour (Cohn et al., 2009; Scherer et al., 2014; Joshi et al., 2013; Cummins et al., 2015; Williamson et al., 2013). Facial activity, gesturing, head movements and expressivity are among behavioural signals that are strongly correlated with depression.

Early paralinguistic investigations into depressed speech found that patients consistently demonstrated prosodic speech abnormalities such as reduced pitch, reduced pitch range, slower speaking rate, and higher articulation errors (Cummins et al., 2015). Facial expression and head gestures that can be tracked by computer vision are also good predictors of depression; e. g., a more downward angle of the gaze, less intense smiles, and shorter average duration of smiles have been reported as the most salient facial cues of depression (Scherer et al., 2013). Further, body expressions, gestures, head movements, and linguistic cues have also been reported to provide relevant cues for depression detection (Morales et al., 2017; Ramirez-Esparza et al., 2008; Pennebaker et al., 2003; Althoff et al., 2016).

Taking all those evidences together, it has been proposed to integrate affective computing technology into a computer agent that interviews people and identifies verbal and nonverbal indicators of mental illnesses (DeVault et al., 2014). Data collected with subjects suffering from post-traumatic stress disorder showed that the automatic evaluation of their level of depression severity (PHQ-8 questionnaire) can achieve a $RMSE$ less than 5 when the agent is driven by a human acting as a WoZ (Gong and Poellabauer, 2017); PHQ-8’s range $\in[0,24]$ and cutpoints are defined at $[5,10,15,20]$ for mild, moderate, moderately severe, and severe depression, respectively. Those results need to be investigated further, with the agent being wholly driven by AI, as the wizard might drive the virtual agent to a situation that eases the observation of patterns associated with depression, or the autonomous agent might have issues in conducting the interview appropriately.

2.3. Cross-cultural Emotion Recognition

Cross-cultural emotion recognition has long been highlighted as an open research question within the affective computing community (D’Mello and Kory, 2015; Elfenbein and Ambady, 2002; Esposito et al., 2015; Pantic et al., 2005), and was introduced as an AVEC Sub-challenge in 2018 (Ringeval et al., 2018a). Whereas the AVEC 2018 CES focused on detecting arousal, valence, and liking from Hungarian speakers using only German speakers for training and development of the models (Ringeval et al., 2018a), in this year’s AVEC CES the test cohort is Chinese speakers with speakers from the two cultures mentioned earlier being available for training, development, and additional testing.

A common belief in facial expression recognition is that emotional expressions have a large degree of universality across cultures (Corneanu et al., 2016; Ekman, 1971). This statement was on the whole supported by both baseline results and works submitted to the AVEC 2018 CES, with either vision-only or multimodal systems achieving higher cross-culture accuracies than speech-only approaches (Ringeval et al., 2017a; Huang et al., 2018; Wataraka Gamage et al., 2018; Zhao et al., 2018). These results were insightful, as previously, there were only a few works in the affective computing literature which supported this claim (Cordaro et al., 2018; D’Mello and Kory, 2015).

Interestingly, approaches in the AVEC 2018 CES did not employ approaches such as transfer learning (Zhang et al., 2017a, b) or domain adaptation techniques (Kaya and Karpov, 2018; Sagha et al., 2016) typically seen in cross-cultural testing. In (Wataraka Gamage et al., 2018), the authors proposed a model based on emotional salient detection to identify emotion markers invariant to socio-cultural context. The other two entrants employed data driven approaches based on long short-term memory recurrent neural networks (LSTM-RNN) (Huang et al., 2018; Zhao et al., 2018). Matching with similar results in the literature (Feraru et al., 2015; Scherer et al., 2001), all entrants in the AVEC 2018 CES observed a drop in system performance when testing on the Hungarian data (Huang et al., 2018; Wataraka Gamage et al., 2018; Zhao et al., 2018).

3. Challenge corpora

The AVEC 2019 Challenge relies on three corpora: (i) the USoM corpus (Rathner et al., 2018b) for the SoMS, (ii) the Extended-DAIC corpus (Gratch et al., 2014) for the DDS, and (iii) the SEWA dataset (Kossaifi et al., 2019) for the CES. We provide below a short overview of each dataset and refer the reader to the original work for a more complete description.

3.1. Ulm State-of-Mind Corpus

The Ulm state of mind database was collected to assess the association between personal story telling and current SOM, operationalised by affective state according to Russel’s theory (Schuller et al., 2018; Rathner et al., 2018b; Russell, 2003). Parts of this dataset have been released for the Interspeech 2018 Computational Paralinguistics (ComParE) challenge (Schuller et al., 2018).

Participants of the USoM corpus were instructed to first tell two negative personal narratives $NN_{1,2}$ and subsequently two positive personal narratives $PN_{1,2}$ , each for five minutes in front of a camera. They were also asked to rate their current affect ( $CA$ ) on a 10-point likert scale for the dimensions arousal and valence before and after telling each narrative, resulting in the following protocol: $(t_{0})$ , $CA_{0},NN_{1}$ , $(t_{1})$ , $CA_{1},NN_{2}$ , $(t_{2})$ , $CA_{2},PN_{1}$ , $(t_{3})$ , $CA_{3},PN_{2}$ , and $(t_{4})$ , $CA_{4}$ . For the purpose of the Challenge, the USoM dataset was partitioned into training, development, and test sets while preserving the overall speaker diversity – in terms of age, gender distribution, and core affect evaluations – within the partitions. Table 1 shows the number of subjects and duration for each partition.

As the interest of the SoMS is on the change in mood, rather than just its static observation, the initial self-reports made before the storytelling are included in the data package given to participants for all partitions, including the test set. Exploiting such contextual information in an automatic system predicting the level of mood is a realistic scenario in the real-world, because a therapist would always ask a person’s baseline emotion at the start of a session. It is thus essential to provide machine learning algorithms with the same prior information as a therapist would have.

3.2. Distress Analysis Interview Corpus

The Extended Distress Analysis Interview Corpus (E-DAIC) (DeVault et al., 2014) is an extended version of WOZ-DAIC (Gratch et al., 2014) that contains semi-clinical interviews designed to support the diagnosis of psychological distress conditions such as anxiety, depression, and post-traumatic stress disorder. These interviews were collected as part of a large effort to create a computer agent that interviews people and identifies verbal and nonverbal indicators of mental illnesses (Gratch et al., 2014).

Data collected include audio and video recordings, automatically transcribed text using Google Cloud’s speech recognition service, and extensive questionnaire responses. The interviews are conducted by an animated virtual interviewer called Ellie. In the WoZ interviews, the virtual agent is controlled by a human interviewer (wizard) in another room, whereas in the AI interviews, the agent acts in a fully autonomous way using different automated perception and behaviour generation modules.

For the purpose of the Challenge, the E-DAIC dataset was partitioned into training, development, and test sets while preserving the overall speaker diversity – in terms of age, gender distribution, and the eight-item Patient Health Questionnaire (PHQ-8) scores – within the partitions. Whereas the training and development sets include a mix of WoZ and AI scenarios, the test set is solely constituted from the data collected by the autonomous AI. Details regarding the speaker distribution over the partitions are given in Table 2.

3.3. Cross-cultural Emotion Database (SEWA)

The SEWA database consists of audiovisual recordings of spontaneous behaviour of participants captured using an in-the-wild recording paradigm (Kossaifi et al., 2019). Pairs of friends or relatives from German, Hungarian, and Chinese cultures were recorded through a dedicated video chat platform which utilised participants’ own – standard – web-cameras and microphones. After watching a set of commercials, pairs of participants were given the task of discussing the last advert watched (a video clip advertising a water tap) for up to three minutes. The aim of this discussion was to elicit further reactions and opinions about the advert and the product advertised.

The video chats of the three cultures have been annotated w. r. t. the emotional dimensions arousal and valence, and a third dimension describing liking (or sentiment), independently by several native speakers; German and Chinese: six annotators, Hungarian: five annotators. The annotation contours (traces) are combined into a single gold-standard using the same evaluator weighted estimator (EWE)-based approach that was used in the last two editions of AVEC (Ringeval et al., 2017b, 2018a). Table 3 shows the number of subjects and the duration of the recordings for each partition.

4. Baseline features

Emotion recognition from audiovisual signals usually relies on feature sets whose extraction is based on expertise gained over several decades of research in the domains of speech processing, e. g., Mel Frequency Cepstral Coefficients (MFCCs), and vision computing, e. g., Facial Action Units (FAUs). However, recent advances in the field of representation learning, whose objective is to learn representations of data that are best suited for the recognition task (Bengio et al., 2013), have shown that efficient representations of audiovisual signals can be learnt in the context of emotion (Trigeorgis et al., 2016; Schmitt et al., 2016; Amiriparian et al., 2017).

Audiovisual representations can be learnt from expert-driven information extracted from the raw signals (Schmitt et al., 2016), or directly from the raw signals (Trigeorgis et al., 2016). They can also be generated using adversarial networks (Deng et al., 2017), or using convolutional neural networks trained on out-of-domain data and for a different task, e. g. , audio representations extracted by a model trained for object classification in images (Amiriparian et al., 2017).

4.1. Expert-knowledge

The traditional approach in affect sensing consists in summarising low-level descriptors (LLDs) of audiovisual signals over time with a set of statistical measures computed over a fixed-duration sliding analysis window. Those descriptors usually include spectral, cepstral, prosodic, and voice quality information for the audio channel, and appearance, geometric, and FAUs information for the video channel.

As audio features, we compute the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) (Eyben et al., 2016), which contains 88 measures covering the aforementioned acoustic dimensions, and used here as baseline. In addition, MFCCs 1-13, including their 1st- and 2nd-order derivatives (deltas and double-deltas) are computed as a set of acoustic LLDs, using the openSMILE222http://audeering.com/technology/opensmile/ (Eyben et al., 2013) toolkit. As visual features, we extract the intensities of 17 FAUs for each video frame, along with a confidence measure, using the toolkit openFace333https://github.com/TadasBaltrusaitis/OpenFace/ (Baltrušaitis et al., 2018). Descriptors of pose and gaze are additionally extracted.

Audiovisual LLDs are summarised over time by computing their mean and standard-deviation using a sliding window of 4 s length, and a hop size of 1 s for the USoM and E-DAIC datasets, and 100 ms for the SEWA dataset, excepted for the eGeMAPS set, which is computed on each window.

4.2. Bags-of-Words

The technique of bags-of-words (BoW), which originates from text processing, represents the distribution of LLDs according to a dictionary learnt from them. As a front-end of the BoW, we use the MFCCs and the eGeMAPS set for the acoustic data, and the intensities of the FAUs for the video data; MFCCs and eGeMAPS LLDs are standardised (zero mean, unit variance) in an on-line approach prior to vector quantisation, while this step is not required for the FAU intensities.

To generate the BoW representations, both the acoustic and the visual features are processed and summarised over a block of a 4 s length duration, for each step of 100 ms for the SEWA dataset, and 1 s for the USoM and E-DAIC datasets. The codebook size is $100$ . Instances are sampled at random to build the dictionary, and the logarithm is taken from resulting term frequencies in order to compress their range. The whole cross-modal BoW (XBoW) processing chain is executed using the open-source toolkit openXBOW444https://github.com/openXBOW/openXBOW (Schmitt and Schuller, 2017).

4.3. Deep Representations

As in last year’s challenge (Ringeval et al., 2018a), we have included Deep Spectrum555https://github.com/DeepSpectrum/DeepSpectrum features as a deep learning based audio baseline feature representation (Amiriparian et al., 2017). Deep Spectrum features are inspired by deep representation learning paradigms common in image processing: spectral images of speech instances are fed into pre-trained image recognition CNNs and a set of the resulting activations are extracted as feature vectors.

For this year’s challenge, we extracted Deep Spectrum features from four robust pre-trained CNNs using VGG-16 (Simonyan and Zisserman, 2014), AlexNet (Krizhevsky et al., 2012), DenseNet-121, and DenseNet-201 (Huang et al., 2017); AlexNet was used in the AVEC 2019 CES purely for consistency with the previous AVEC 2018 CES. The speech files are first transformed into mel-spectrogram images with 128 mel-frequency bands, a window width of 4 s for all challenge corpora and a hop size of 1 s for the USoM and E-DAIC datasets, and 100 ms for the SEWA dataset. Following that, the spectral-based images are forwarded through the pre-trained networks. A 4 096-dimensional feature vector is then formed from the activations of the second fully connected layer in VGG-16 and AlexNet, and a 1 024 and a 1 920-dimensional feature vector is obtained from the activations of the last average pooling layer of the DenseNet-121 and DenseNet-201 networks, respectively.

We also provide two baseline deep visual representations. For these, we employed a VGG-16 (Simonyan and Zisserman, 2014) network and a ResNet-50 network (He et al., 2016) that are pre-trained with the Affwild dataset (Kollias et al., 2019). The pipeline starts with applying the openFace toolkit (Baltrušaitis et al., 2018) to detect the face region and subsequently perform face alignment. Then, we froze the weights of two pre-trained models and fed the aligned face images to both CNNs individually. To obtain the deep representations for each frame, we extract the output of the first fully-connected layer from the pre-trained VGG-16 network, and the output of the global average pooling layer from the pre-trained ResNet-50 network, respectively. As a result, a 4 096-dimensional deep feature vector from VGG and a 2 048-dimensional deep feature vector from ResNet are provided for each frame.

5. Baseline systems

All baseline systems rely exclusively on existing open-source machine learning toolkits to ensure the reproducibility of the results. In this section, we describe the systems developed for each Sub-challenge, and present the obtained results. For evaluation on the test set, we retained the two audio representations with the best performance, and the two video representations with the best performance, in addition to the fusion of all audiovisual representations.

5.1. State-of-Mind Sub-challenge

We use a gated recurrent unit (GRU) network with two layers, each having 64 nodes for their hidden layers, for each audiovisual representation. As a pre-processing step, all input features are normalised to have zero mean and unit variance. Dropout, at a rate of 10 %, is employed during training. The GRU is then followed by a fully connected neural network that has one hidden layer with 32 nodes, followed by a single linear layer to map to the desired output size of one. Note that a middle-fusion of the audiovisual representations is performed by concatenating their respective GRU outputs.

The model is implemented using a Pytorch framework and is trained with an Adam optimiser. As previous studies have shown the benefits of training a network following a curriculum (Bengio et al., 2009; Lotfian and Busso, 2019), where instances are gradually presented in increasing level of difficulty, we implemented this approach using the following strategy. First, a uniform distribution of valence labels is obtained by duplicating training instances, then, a sub-set of the training set with only the data instances with $CA\in[2-3]\cup[9-10]$ , i. e. , the most positive and negative storytellings, is firstly used for training, followed by a larger sub-set with data instances with $CA\in[2-4]\cup[8-10]$ , each for 32 epochs. We then exploited the whole training set until early stopping occurs; once 60 epochs have passed, training is stopped if there is no improvement within the last 25 epochs.

Because the interest of the SoMS is in the analysis of a change in human SOM, the network is trained to model the difference between the self-reported core affect after each story and before the first story: $CA_{i}-CA_{0},i=1,2,3,4$ . Results are reported for each audiovisual representation, and for the two training approaches, i. e. , with or without curriculum, in Table 4. Whereas the mid-fusion of all audiovisual representations provides the best result on the development set for the two learning approaches, audio descriptors achieve higher performance on the test set, with the expert-based eGeMAPS set performing best with curriculum learning.

A summary of the results obtained with either a static ( $CA_{i}$ ) or a dynamic ( $CA_{i}-CA_{0}$ ) view of the self-reported mood used for training or testing the system is also provided in Table 5. Interestingly, results show that the automatic inference of the self-reported mood performs much better in a ‘mixed’ scenario, i. e. , training on the static view ( $CA_{i}$ ) and evaluating on the change ( $CA_{i}-CA_{0}$ ) or vice-versa training on the change and testing on the static label, compared to a ‘consistent’ approach with both training and testing performed on the same view, i. e. , either static or dynamic.

This result might stem from emotion data being hierarchically organised. As such, each self-reported emotion is nested within a person over a period of time (Koval et al., 2012). Because of human’s inability to assess their own emotions as an absolute value, self-reported emotion can only be interpreted as a current assessment of emotional differences in relation to the nearest past. Furthermore, there is also variance in emotion dynamics between people and not only within a person (Koval et al., 2013). The inter-individual and intra-individual variance in emotion dynamics are strongly related to one another, but add both new information to predictions. While the variance between persons might be best captured in a scenario where machine learning is applied to raw values, the intra-individual auto-correlation of emotion, the so-called emotional inertia, is portrayed in the dynamic evaluation (Kuppens et al., 2010). Therefore, training on static data and evaluating on dynamic data, such as emotional inertia, might be the state-of-the-art approach to characterise human SOM.

5.2. Detecting Depression Sub-challenge

For the depression detection baseline, we employ a single-layer 64-d GRU as our recurrent network with a dropout regularisation of rate 20 %, followed by a 64-d fully-connected layer to obtain a single-value regression score. To handle bias, we convert the PHQ-8 score labels to floating point numbers by downscaling with a factor of 25 prior to training. The network is trained and evaluated using a $CCC$ loss function and evaluation score, and the $RMSE$ results are reported using the original PHQ scale. A batch size of 15 is used consistently, and the learning rate is optimised across different feature sets. In order for the data to fit on GPU memory, a maximum sequence length has been assigned for the sessions. For the MFCCs and eGeMAPS LLDs, and the high dimensional deep representations like DeepSpectrum, ResNet, and VGG, a maximum sequence length of 20 minutes is used. Additionally, for ResNet, VGG, and Deep Spectrum representations frames are dropped keeping one out of two, or one out of four frames depending on the dimensionality so that the data can be loaded onto memory. Fusion of the different audiovisual representations is achieved by averaging their scores.

Baseline results of the DDS are given in Table 6. They show that, on the development set, the best $CCC$ score from audio features was achieved with Deep spectrum (DS-VGG) features, and the model with ResNet features achieved the best result for visual features. These results indicate the power of representations learnt by deep neural networks with a large amount of data when being used in a different context to which they were initially designed, which is confirmed on the test set with the ResNet visual model achieving the best result, despite a relatively low $CCC$ .

Fusion of the different representations achieves the best result on the development set, and the $RMSE$ returned on the test set is slightly better than the one obtained on the DAIC-WoZ dataset with the AVEC 2017 baseline system (Ringeval et al., 2017b); $RMSE=6.37$ for AVEC 2019 compared to $RMSE=6.97$ for AVEC 2017. However, the baseline system developed for this year’s Challenge is more complex – a simple linear regression model vs GRU-RNNs for this year –, and the corresponding scores should be therefore best regarded in the light of the best results of the AVEC 2017 Depression Sub-challenge (Gong and Poellabauer, 2017), which was $RMSE=4.99$ .

On the basis of the results obtained in the automatic sensing of the level of depression from interactions with the virtual agent, recognition seems more challenging when the agent is solely AI driven, than when a human is driving the agent as a WoZ. This observation opens interesting research questions for designing the agent in a way that the observation of depression cues can be maximised, e. g. , by reinforcement learning, according to the interaction style of the agent.

5.3. Cross-cultural Emotion Sub-challenge

For the baseline system of the CES, we employ a 2-layer LSTM-RNN (64 / 32 units) as a time-dependent regressor of the three targets (learnt together) for each representation of the audiovisual signals, and SVMs – liblinear with L2-L2 dual form of the objective function – for the late fusion of the predictions. The model is implemented using the Keras framework. The network is trained for 50 epochs with the RMSprop optimiser using a dropout rate of 10 %, and the model providing the highest CCC on the development set of the German and Hungarian culture is used to generate the predictions for the test sets (German, Hungarian, and all clips of the Chinese culture). Even though the model has three outputs modelling each dimension, the optimum model for each dimension is selected separately. The predictions of all test sequences from each culture are concatenated prior to computing the $CCC$ , whose opposite is used as loss function for training the networks (Trigeorgis et al., 2016; Weninger et al., 2016).

In order to perform time-continuous prediction of the emotional dimensions, audiovisual signals were processed with a sliding window of 4 s length, which is a compromise to capture enough information to be used with both static regressors, such as SVMs, and context-aware regressors, such as RNNs. We utilised frame-stacking for the SVM-based late fusion of the audiovisual representations with either past, or future context.

Baseline results of the CES are given in Table 7. They show improvements over the performance reported in the previous edition of the AVEC CES; relative improvement for German is 7.25% and 8.25% for arousal and valence, respectively, and for Hungarian, 17.3% and 13.3%, respectively. The inclusion of instances of the Hungarian culture as training and development material, in addition to those of the German culture, might explain the large increase in performance for both cultures, as only instances of the German culture were available for training and development in the AVEC 2018 CES. In addition, a more recent version of the openFace toolkit (Baltrušaitis et al., 2018) was exploited, which provided the best results on the test set for both arousal and valence with FAUs based features. Those results confirm the common view that facial expressions of emotion have a large degree of universality across cultures, compared to the vocal expressions, where the acoustic and prosodic dimensions already play a key role in the oral communication by serving many grammatical and pragmatic functionalities, e. g. , in tonal languages like Mandarin, the meaning associated with a syllable depends on its pitch contour. Such language dependent peculiarities make cross-cultural settings highly challenging, especially when noise comes into play because of the ecological conditions of study.

6. Conclusions

In this paper, we introduced AVEC 2019 – the sixth combined open Audio/Visual Emotion and Health assessment challenge. It comprises three Sub-challenges: i) SoMS, where the level of mood has to be predicted from positive and negative personal stories; ii) DDS, where the level of depression (PHQ-8 score) has to be predicted from structured interviews conducted by a virtual agent wholly driven by AI; and, iii) CES, where the level of affective dimensions of arousal, valence, and liking has to be inferred in a cross-cultural in-the-wild paradigm with German and Hungarian cultures as training and testing material, and Chinese culture as solely testing material.

By intention, we opted to use exclusively open-source software and the highest possible transparency and realism for the baselines, by using the same number of trials as given to participants for reporting results on the test partition, and sharing all the developed scripts for both features extraction and machine learning on a public platform. Results indicate that: i) in the SoMS, the level of mood was best predicted when the system was trained on the static scores and evaluated on their dynamic view, i. e. , between the label provided after the storytellings, and before the first story, which can be explained by inertial emotion theories; ii) in the DDS, prediction of the level of depression (PHQ-8) is reported to be more challenging when the virtual agent conducting the interview is wholly driven by AI, compared to a WoZ setup; and iii), in the CES, dimensional emotions are more challenging to sense in a cross-cultural setting for audio descriptors compared to video descriptors, which confirm on one hand the universality of facial expressions for Asian (Chinese) and Western European cultures (German and Hungarian), and show on the other the challenge of using audio descriptors for paralinguistics analysis in languages presenting dissimilarities in their acoustic, in particular when data are collected in an ecological (noisy) environment.

Acknowledgements.

The research leading to these results has received funding from the Horizon 2020 Programme through the Research Innovation Action No. 826506 (sustAGE), and No. 688835 (DE-ENIGMA). Further funding has also been received from the Innovative Medicines Initiative 2 Joint Undertaking under grant agreement No. 115902, which receives support from the European Union’s Horizon 2020 research and innovation program and EFPIA. The work on the DDS was supported in part by the U.S. Army. Any opinion, content or information presented does not necessarily reflect the position or the policy of the United States Government, and no official endorsement should be inferred. The authors further thank the sponsor of the challenge – audEERING GmbH.

Bibliography83

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Althoff et al . (2016) Tim Althoff, Kevin Clark, and Jure Leskovec. 2016. Large-scale Analysis of Counseling Conversations: An Application of Natural Language Processing to Mental Health. Transactions of the Association for Computational Linguistics 4 (2016), 463–476.
3Amiriparian et al . (2017) Shahin Amiriparian, Maurice Gerczuk, Sandra Ottl, Nicholas Cummins, Michael Freitag, Sergey Pugachevskiy, Alice Baird, and Björn Schuller. 2017. Snore sound classification using image-based deep spectrum features. In Proc. of INTERSPEECH 2017, 18th Annual Conference of the International Speech Communication Association . ISCA, Stockholm, Sweden, 3512–3516.
4Association (2013) American Psychiatric Association. 2013. Diagnostic and Statistical Manual of Mental Disorders (DSM-5) . American Psychiatric Publishing, Arlington, VA.
5Baltrušaitis et al . (2018) Tadas Baltrušaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Open Face 2.0: Facial Behavior Analysis Toolkit. In Proc. 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018) . IEEE, Xi’an, P. R. China, 59–66.
6Baumel and Yom-Tov (2018) Amit Baumel and Elad Yom-Tov. 2018. Predicting user adherence to behavioral e Health interventions in the real world: examining which aspects of intervention design matter most. Translational Behavioral Medicine 8, 5 (2018), 793–798.
7Bengio et al . (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 4 (August 2013), 1798–1828.
8Bengio et al . (2009) Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum Learning. In Proc. International Conference on Machine Learning (ICML) . ACM, Montreal, QC, Canada, 41–48.