Towards Emotion Retrieval in Egocentric PhotoStream
Estefania Talavera, Petia Radeva, Nicolai Petkov

TL;DR
This paper introduces a method for sentiment analysis of egocentric photo streams, aiming to identify positive, neutral, or negative events to enhance emotion retrieval from wearable camera data.
Contribution
It presents a novel approach for assigning sentiment to events in egocentric photostreams, advancing emotion recognition in wearable camera data analysis.
Findings
Achieved 75% classification accuracy on test data.
Demonstrated the feasibility of sentiment recognition in egocentric images.
Opened new avenues for emotion retrieval in wearable camera research.
Abstract
The availability and use of egocentric data are rapidly increasing due to the growing use of wearable cameras. Our aim is to study the effect (positive, neutral or negative) of egocentric images or events on an observer. Given egocentric photostreams capturing the wearer's days, we propose a method that aims to assign sentiment to events extracted from egocentric photostreams. Such moments can be candidates to retrieve according to their possibility of representing a positive experience for the camera's wearer. The proposed approach obtained a classification accuracy of 75% on the test set, with a deviation of 8%. Our model makes a step forward opening the door to sentiment recognition in egocentric photostreams.
Click any figure to enlarge with its caption.
Figure 2| Positive | Neutral | Negative | ||||||
|---|---|---|---|---|---|---|---|---|
| petals | christmas | award | car | study | bible | tumb | bug | nightmare |
| rose | winter | present | cars | science | book | tumbstone | bugs | accident |
| flora | snow | honor | machine | history | card | monument | insect | shadows |
| park | santa | gift | vehicle | economy | stiletto | grave | worm | noise |
| yard | sketch | heroes | rally | market | sins | memorial | cockroach | scream |
| plant | cartoon | dolls | train | industry | record | stone | decay | night |
| garden | drawing | dolls | competition | statue | paper | graveyard | garbage | darkness |
| comics | toy | race | sculpture | poem | cementery | trash | shadow | |
| illustration | toys | control | museum | interview | grief | shit | ||
| humor | lego | metal | pain | |||||
| Accuracy | F-Score | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|||||||||||||
| Ours | 0.60 | 0.63 | 0.73 | 0.35 | 0.43 | 0.59 | ||||||||||||
|
0.68 | 0.66 | 0.68 | 0.48 | 0.45 | 0.48 | ||||||||||||
|
0.65 | 0.65 | 0.66 | 0.41 | 0.43 | 0.47 | ||||||||||||
| Accuracy | F-Score | |||
|---|---|---|---|---|
|
||||
| Ours | 0.750.08 | 0.600.13 | ||
|
0.690.1 | 0.500.15 | ||
|
0.740.1 | 0.580.15 | ||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Video Surveillance and Tracking Methods
11institutetext: University of Barcelona, Spain, and University of Groningen, The Netherlands
[email protected],[email protected],[email protected]
Towards Egocentric Sentiment Analysis
Estefania Talavera
Petia Radeva
Nicolai Petkov
Abstract
The availability and use of egocentric data are rapidly increasing due to the growing use of wearable cameras. Our aim is to study the effect (positive, neutral or negative) of egocentric images or events on an observer. Given egocentric photostreams capturing the wearer’s days, we propose a method that aims to assign sentiment to events extracted from egocentric photostreams. Such moments can be candidates to retrieve according to their possibility of representing a positive experience for the camera’s wearer. The proposed approach obtained a classification accuracy of 75% on the test set, with deviation of 8%. Our model makes a step forward opening the door to sentiment recognition in egocentric photostreams.
Keywords:
egocentric images, moment retrieval, sentiment analysis
1 Introduction
Lifelogging describes an egocentric vision of the experiences of a person. Nowadays, the use of small wearable cameras, which capture images in certain intervals, is increasing considerably. Such images provide an overview of the daily activities of a person that can be interpreted as a visual log of the day. This information can be used to examine a person’s pattern of behaviour; daily habits, such as eating habits, social interactions, indoor or outdoor activities, are recorded in such images. Although our mood is influenced by the environment and social context that surrounds us, egocentric data do not always catch our attention or induce the same emotion when retrieved. We consider that the creation of a diary of positive moments in an electronic way, by combining several cues, will help to improve the inner perception of the user’s own life. Therefore, in this work we seek for positive moments that can raise the user’s positiveness.
Several experiments have been conducted in that direction, in [14] the authors presented a survey of positive psychology strategies that demonstrated to be potentially effective as tools for the treatment of depression. As an example, in [1] the authors suggested to the participants activities such as walking in a park, visiting a friend, or going to the Student Union and saying hello to someone, that resulted in participants’ suffering to decline or disappear. Their study gives ideas about the type of moments that can retrieve positive feelings; nature, landscapes, friends, or smiling people staring at us, among others.
Sentiment analysis from images is a novel research field. Given the challenge of image sentiment recognition and the ambiguity of the problem, we analyse images sentiment assigning a discrete ternary sentiment value (positive (1), neutral (0) or negative (-1) value), similar to [18]. In the literature, sentiment recognition from images has been approached based on different types of data features. Attributes such as facial expressions are used for sentiment prediction in [8, 20]. The combination of visual and textual information from the images [16, 17] appeared due to the wide use of online social media and microblogs. In such websites, images are posted with short descriptive comments by the user. Audio features were also included in the models presented by [11, 13].
Despite several works having approached the understanding of how people can be affected seeing images, the field of sentiment analysis from images has not been yet settled. The labelling of the images is not yet established, leading to different questions when addressing sentiment recognition from images. As examples, we can find available datasets labelled as: amusement, anger, awe, disgust, excitement, fear, sad [10, 19], Positive/Negative [2]-Twitter, Positive/Negative/Neutral[5], with values of Pleasure, Arousal and Dominance [7], with sentiment values from -2 to 2[2], where the extremes correspond to Negative and Positive sentiments, respectively. Despite the above mentioned works, to the best of our knowledge, none of them has dealt with the sentiment recognition from egocentric photostreams.
Recently, with the outstanding performance of the Convolutional Neural Networks (CNN), several approaches on sentiment analysis have relied on supervised learning through deep learning techniques, such as [3, 8, 9, 19]. One of the more remarkable approaches, in [2], introduced a Visual Sentiment Ontology (VSO), based on the Plutchik’s wheel of emotions [12], and a visual concept detector called SentiBank. The VSO is built by 3022 semantic concepts called Adjective Noun Pairs (ANP) represented by images from the social net Flickr with them as tags. The ANPs are composed by a pair of a noun and an adjective, with a sentiment value associated between [-2 : 2]. They defend that an object, according to its appearance has a different sentiment value associated to it, like ’lonely boat’ (-1.43) and ’traditional boat’ (1,37), or ’noisy bird’ (-1) and ’cute bird’ (1,37). They proposed the semantic concepts baseline classification based on visual features (RGB, SIFT, LBP, etc) extracted from the images. In [4], a Deep Neural Network named DeepSentiBank was trained on Caffe for the VSO semantic concepts classification. The authors relied on the concepts with higher number of images and with a classification accuracy associated. From the original 3022 concepts in [2] they select 2089 in [4].
To the best of our knowledge, previous to our work [15] there was no approaches addressing sentiment recognition from egocentric photostreams. We propose to analyse the output of the DeepSentiBank per image as semantic representation. We defined a classification model where the one-vs-all SVM classifiers were trained and evaluated with the features describing semantic and global information from the images.
In this work, we approach the same problem from a different perspective. We analyse the relation to each other semantic concepts extracted from images that belong to the same scene. A scene is described as a group of sequential images related between them and describing the same event. Our contribution is an analytic tool for positive emotion retrieval seeking for events that best represent a positive moment to be retrieved within the whole set of a day photostream. We focus on the event’s sentiment description where we are observers without inner information about the event, i.e. from an objective point of view of the moment under analysis.
The rest of the paper is organized as follows. In Sect. 2, we describe the sentiment analysis method and the features selection procedure. In Sect. 3, we describe the proposed dataset, while in Sect. 4, we describe the experimental setup, the evaluation, and discuss our findings. Finally, Sect. 5 draws conclusions and outlines future lines of work.
2 Method
Given an egocentric photostream, we propose scene emotion analysis seeking for events that represent and can retrieve a positive feeling from the user. We apply event-based analysis since single egocentric images cannot capture the whole essence of the situation. By combining information from several images that represent the same scene, we get closer to a better understanding of the event.
2.1 Temporal segmentation:
We apply temporal segmentation on the egocentric photostreams using the proposed method in [6]. The clustering procedure is performed on an image representation that combines visual features extracted by a CNN with semantic features in terms of visual concepts extracted by Imagga’s auto-tagging technology111http://www.imagga.com/solutions/auto-tagging.html. In Fig. 1 we present some examples of events extracted from the dataset, we introduce below.
2.2 Event’s sentiment recognition:
The model relies on semantic concepts extracted from the images to infer the event sentiment associated. However, it relies not only on the semantic concepts extracted by the net with their sentiment associated, but also on how those semantic concepts can be interpreted by the user. We apply the DeepSentiBank Convolutional Neural Network[4] to extract the images semantic information since it is the only introduced model that extract semantic concepts (ANPs) with sentiment values associated. Given an image, the output of the network is a 2089-D feature vector, where the values correspond to the ANPs likelihood in the image.
Besides taking into account the sentiment associated to the ANPs, the influence of the common concepts within an event are also analysed. We categorize the noun into Positive, Neutral or Negative. There is a wide range of semantic concepts within the ontology, but many of them seem to repeat concepts that even from the user perspective would be difficult to differentiate when looking at an image; such as ”girl” from ”woman” or ”lady”.
When facing our egocentric images challenge, the VSO presents several drawbacks. On one hand, this tool is trained to recognise up to 2089 concepts, which can not describe all possible scenarios. On the other hand, despite including that big amount of concepts, many of them categorize objects into categories difficult to visually interpret or differ by the human eye. Examples can be the distinction between ’child’, ’children’, ’boy’, or ’kid’ from an image. In order to overcome this problem, we generate a parallel ontology with what we consider an egocentric view of the concepts, i.e., we cluster the concepts a person would merge based on their semantic.
Egocentric analysis of the VSO: We cluster the semantic concepts based on the similarities between the noun components of the ANPs, which are computed using the wordNet tool222 http://wordnet.princeton.edu. Following what would be considered as similar from an egocentric point of view, we manually refine the resulted clusters into 44 categories. We label the clusters as Positive, Neutral or Negative. In Table 1 we present some of the egosemantic clusters.
2.3 Sentiment Model:
Given an event, the event’s sentiment analysis model (see Fig. 2) performs as follows;
Given the ego-photostream we apply the temporal segmentation, analyse events with a minimum of 6 images, i.e. that last for at least 3 minutes. 2. 2.
Extract the ANPs of each event frame and rank them by their probability () of describing an image. 3. 3.
Select the top-5 ANPs per image, since we consider that those are the concepts with higher relevance, thus better capturing the image’s information. After this step the model ends up with a total of M semantic concepts per event, where Number of images . 4. 4.
Cluster the semantic concepts based on their Wordnet-based nouns semantic distances. As a result, we have clusters of concepts with semantic similarity. For the event sentiment computation (), focus on the largest cluster. 5. 5.
Finally, fuse the sentiment associated to the ANPs and noun’s cluster following the eq. (1):
[TABLE]
where is the ANP’s sentiment given by the VSO and is the label of the noun, and are the contributions (%) of the ANPs and the nouns. Take into account the probability associated to the ANPs aiming to penalize the ANPs with low relation to the image content.
3 Experiments Setup
3.1 Dataset:
We collected a dataset of 4495 egocentric pictures, which we call UBRUG-Senti. The user was asked to wear the Narrative Clip Camera333http://getnarrative.com/ fixed to his/her chest during several hours every day and was asked to continue with his/her normal life. Since the camera is attached to the chest, the frames vary following the user’s movement and describe the user’s view of his/her daily indoor/outdoor activities. It involves challenging backgrounds due to the scene variation, handled objects appearing and disappearing during images sequences, and the movement of the user. The camera takes a picture every 30 seconds, hence each day around 1500 images are collected for processing. The images have a resolution of 5MP and JPG format.
After the temporal clustering [6], we obtained a dataset composed of 4495 images grouped in a total of 98 events. The events were manually labelled based on how the user felt while reviewing them. The labels assigned were Positive (36), Negative (43) and Neutral (19). Some examples are given in Fig 1.
3.2 Experiments:
During the experimental phase, we evaluated the contributions of ANPs and nouns by defining different combinations of and . We performed a balanced 5-fold cross validation. For each of the folds, we used 80% of the total of events per label of our dataset and compute the best pair of and values. This is a parameters selection process that is later re-evaluated in a test phase with a different set of events.
Validation: To evaluate the effectiveness of the scene detection approach, we use the Accuracy, as the rate of correct results, and the F-Score (F1). The F1 is defined as : , where is the precision , is the recall and , and respectively are the number of true positives, false positives and false negatives of the event’s sentiment label correctly identified.
Results: Tables 2 and 3 present the results achieved by the proposed method at image and event level, respectively. The model achieves an average training accuracy of 733.8% and F-score of 595.4% and test accuracy of 758.2% and F-score of 6113.2%, when and , i.e. when the ANP information is considered; although the major contribution comes from the noun sentiment associated. As expected, neutral events are the most challenging ones to classify.
In order to contextualize our results, we fine-tune the well-known GoogleNet deep convolutional neural network [9] to classify into Positive, Neutral and Negative. We use 80%, 10% and 10% of the dataset for training, validation and testing respectively. The network achieves an accuracy of 55%.
From the results we can conclude that the application of the DeepSentiBank presents drawbacks when applied to egocentric photostreams. To begin with and as commented before, the 2089 ANPs not necessarily have the power to represent what the image captured about the scene, taking into account the difficulty to detect them automatically (Mean average accuracy of the net 25%). Moreover, the ANPs present the limitation that they are classified strictly into Negative or Positive concepts. Thus, moments from our daily routine, which are often considered as neutral, are difficult to recognize.
4 Conclusions
We present a new model for positive moments recognition from our digital memory, composed by images recorded by the Narrative wearable camera. It analyses semantic concepts called ANPs extracted from the images. These semantic concepts have a sentiment value associated and describe the appearance of concepts in the images. The sentiment prediction tool is based on new semantic distance of ANPs and fusion of ANPs and nouns sentiments extracted from egocentric photostreams. The proposed approach obtained a classification accuracy of 75% on the test set, with deviation of 8%. Future experiments will address the generalization of the model over datasets collected by other wearable cameras, as well as recorded by different users. Analysing the results obtained, we conclude that the polarity of the ANPs makes it difficult to classify ’Neutral’ events. However, most of our daily life is composed by neutral events, which can be considered as routine. Thus, in future lines we will address the routine recognition and retrieval.
Acknowledgements: This work was partially founded by Ministerio de Ciencia e Innovación of the Gobierno de España, through the research project TIN2015-66951-C2. SGR 1219, CERCA, ICREA Academia 2014 and Grant 20141510 (Marató TV3). The funders had no role in the study design, data collection, analysis, and preparation of the manuscript.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J. Beck and et al. Stimulating Therapeutic Change with Interpretations: A comparison of positive and negative connotation. Counseling Psychology , 1986.
- 2[2] D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang. Large-scale visual sentiment ontology and detectors using adjective noun pairs. ACM , pages 223–232, 2013.
- 3[3] V. Campos and et al. Diving Deep into Sentiment: Understanding Fine-tuned CN Ns for Visual Sentiment Prediction. ASM , pages 57–62, 2015.
- 4[4] T. Chen, D. Borth, T. Darrell, and S.-F. Chang. Deep Senti Bank: Visual Sentiment Concept Classification with Deep Convolutional Neural Networks. page 7, 2014.
- 5[5] E. S. Dan-Glauser and K. R. Scherer. The Geneva affective picture database (GAPED): a new 730-picture database focusing on valence and normative significance. Behavior research methods , 43(2):468–77, 2011.
- 6[6] M. Dimiccoli, E. Talavera, S. G. Nikolov, and P. Radeva. SR-Clustering: Semantic Regularized Clustering for Egocentric Photo Streams Segmentation. 2015.
- 7[7] P. Lang, M. Bradley, and B. Cuthbert. International Affective Picture System (IAPS): Technical Manual and Affective Ratings. NIMH , pages 39–58, 1997.
- 8[8] G. Levi and T. Hassner. Emotion Recognition in the Wild via Convolutional Neural Networks and Mapped Binary Patterns. ICMI , pages 503–510, 2015.
